Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive tokenizer samples variable segmentations from multiple outcomes, with sampling probabilities optimized based on task-specific data. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol that allows for the integration of task-specific tokens into the pre-trained model's tokenization step. Through extensive experiments on psychological question-answering tasks in both Chinese and English, we find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens. Preliminary experiments point to promising results when using our tokenization approach with very large language models.


Introduction
During a time when mental health support is quickly growing (Hock et al., 2012;WHO, 2020), text generation techniques have been identified as potentially useful tools to assist mental health professionals (MHP) and provide mental health support to those who share their struggles online (Demasi et al., 2020;Liu et al., 2021a;Sabour et al., 2022).With the help of such generation tools, counselors and social workers alike can increase their work efficiency and offer timely and expert feedback to those in need, especially for those in under-resourced areas where access to mental health services is challenging (Patel et al., 2011;Brenman et al., 2014;Shen et al., 2022a).
The task of Psychological Question-Answering (PsyQA) is to generate a supportive response to a 1 Our work is available at github.com/MichiganNLP/task-adaptive_tokenization.  help-seeking post, with an important requirement that the language used in the response aligns with the style used by MHPs (Sun et al., 2021;Welch et al., 2020).The task poses significant challenges due to its domain-specific terminology, text structure, and language style, which are often underrepresented in the pre-training data.Recent research has demonstrated that the linguistic patterns of MHPs align with established counseling principles and behaviors (Lahnala et al., 2021), thus exacerbating the difficulty of the task as pre-training data is often sourced from laypeople (Kalla and Smith, 2023).Despite recent progress in large-scale language models (LLMs) (Touvron et al., 2023;Cui et al., 2023;Hu et al., 2022), we found that few-shot abilities of these models lag behind in mental health applications due to this misalignment with the MHP linguistic behaviors and style (see Appx A).This requires us to tune models to the downstream domain.However, much of the current work focused on fine-tuning (Shen et al., 2020;Ji et al., 2022) still yields unsatisfying results due to the scarcity of professional data, which is often insufficient to capture all the task-specific intricacies.
In this paper, we introduce task-adaptive tokenization as a strategy to adapt a model to a downstream task.Rather than focusing the adaptation effort on post-training fine-tuning, which has been the typical approach used in recent work (Shen et al., 2020(Shen et al., , 2022b;;Sun et al., 2021;Lee et al., 2021), we propose to make task-specific adjustments in the way the language is segmented during the tokenization stage (see our motivation elaborated in Sec 3).Thus, we propose task-adaptive tokenization, which is built specifically from the task data and equipped with variable text segmentation, yet can be seamlessly integrated into off-the-shelf language models.To illustrate the potential of task-adaptive tokenization, we show an example in Fig 1.As seen in this example, our tokenization process allows for the inclusion of domainspecific terminology such as [_social_isolation] and [_a_sense_of_purpose] in the vocabulary.A model trained with task-specific tokenization is now able to generate these tokens through learned preference, which we show can lead to significant performance improvements.
We confront three major challenges when designing tailored tokenization strategies.First, The creation of a task-specific tokenization vocabulary must be performed through an automatic process due to the labor-intensive and time-consuming nature of manual selection.Integrating this taskspecific vocabulary seamlessly with pre-trained models poses is challenging, and it requires techniques to fuse the task-specific vocabulary with the pre-trained vocabulary and fine-tune the resulting model accordingly.Lastly, we need to address the poor representation of newly added tokens in task-specific vocabularies that were not learned during the pre-training phase.
In this paper, we propose task-adaptive tokenization as a method for enhancing text generation in specialized tasks such as PsyQA.This paper makes three main contributions.(1) Building on insights from cognitive linguistics (Thorndyke, 1977;Wells, 1947), we advocate for using taskspecific data and the developing variable segmentation for a downstream vocabulary as a pre-step for creating a task-adaptive tokenizer.(2) We construct a protocol for merging task-specific and pretrained vocabularies, allowing for fine-tuning inputs to be sampled from multiple tokenization results.
(3) We propose a simple yet effective initialization mechanism to alleviate the difficulty of learning representations for new tokens unseen during pre-training.Through thorough experiments on the PsyQA task in both Chinese and English, we demonstrate the significant improvements achieved by our task-adaptive tokenization approach.Notably, we achieve these enhancements while utilizing 60% fewer tokens compared to expressing equivalent content length.In addition, we show that our tokenization brings significant improvement in 7B LLaMA models, which suggests that our method is effective regardless of the model size and can unlock additional performance despite the booming era of LLMs.

The PsyQA Task
The goal of the PsyQA task is to generate a supportive response to the help-seeker via responding to their post, where an essential requirement is to imitate the use of language that is characterized as professional by previous work (Sun et al., 2021;Welch et al., 2020).Figure 3 shows an example of a question-answer pair in this dataset.Posts and responses are often extensive, with help-seekers providing detailed accounts of their experiences, and those offering assistance providing comprehensive views, including emotional comfort, indepth analysis, and various suggestions.
The formal definition of the task is as follows: given a question Q, a description D, and keywords K, let context C denote an integral description of Q, D, and K; c = (c 1 , c 2 , ..., c m ) is the sequence by segmenting the context C. We aim to generate a response R (a sequence of r = (r 1 , r 2 , ..., r n )) corresponding to the context C.

Motivation
Our motivation builds on arguments stemming from cognitive science, where (1) a clear distinction is being made between the vocabulary used to interpret language versus the vocabulary used for language production; and (2) there is evidence for increased efficiency in speech behavior stemming from individual segmentation granularities.These arguments are further expanded by optimization and efficiency goals, which are better achieved in the presence of flexible segmentation.

Receptive vs Productive Vocabulary
Within cognitive linguistic research, a clear distinction is being made between "receptive" and "productive" vocabulary (Bogaards and Laufer, 2004;Staehr, 2008) -the former referring to words comprehended while reading, and the latter to words actively used in writing or speaking.A strong productive vocabulary has a direct impact on writing quality (Engber, 1995;Fajrina et al., 2021) and is essential for effective communication and precise articulation, particularly in technical fields where specialized terminology is common (Maskor et al., 2016;Faraj, 2015).We, therefore, hypothesize that while creating a large-scale vocabulary is essential for training language models (i.e., the "receptive" vocabulary), generation tasks require more emphasis on designing and leveraging task-related vocabulary (i.e., the "productive" vocabulary).
To illustrate this gap in practice, considering the PsyQA task as described earlier, a typical optimization objective used for fine-tuning would be where Here, c and r are sequences of tokens, i.e., the segmentations of the input context C and the response R. The input of the function starts from c and r instead of the original texts C and R, due to the common practice of using a vocabulary that determines the mapping relationship between texts and tokens.Thus, vocabulary construction is not necessarily considered in the process of optimization.However, if we do not assume the vocabulary in this process, we obtain the log-likelihood: where As seen in Equation (2), different segmentations of a text can influence the entropy of the training corpus and thus can influence the model's performance.
In practice, researchers often blindly adopt a pre-existing vocabulary without considering the potential distribution inconsistency between the train data (typically used to generate the vocabulary) and the data of the downstream task, which can hinder the downstream performance (Liu et al., 2021c).For example, the data on which the word-piece model is trained to obtain the BERT vocabulary originates from Google's Neural Machine Translation benchmark (Wu et al., 2016).The composition of this vocabulary is designed for machine translation, which may not be ideal for PsyQA or for other tasks, according to Equation (2).This additionally supports our argument that a custom vocabulary informed by the task at hand is needed to achieve the best optimization potential.

Efficient Speaker Behavior
Empirical studies have demonstrated that during language production, individuals tend to pause at various syntactic constituents, such as sentence boundaries or between clauses (Abney, 1992;Gee and Grosjean, 1984;Torrance et al., 2007).This phenomenon, referred to as "pause behavior," has been a popular research topic in cognitive linguistics (Thorndyke, 1977;Wells, 1947).A possible explanation for this behavior is the fact that different individuals produce texts at various granularities, from single letters and words to phrases and even entire sentences.When certain expressions are regularly used, they are stored as whole units in our memories, thereby reducing the cognitive load for future usage.
Building on this argument, we hypothesize that implementing a similar strategy in text generation can equally lead to more efficient behavior.Similar to human speakers, with variable segmentation, we can accommodate tokens at the sub-word level to address the rare word problem (Sennrich et al., 2015;Wu et al., 2016) while also including larger granularity units such as phrases and clauses.
This argument is further supported by previous work that has demonstrated that a fine-grained segmentation, despite its flexibility, can lead to increased computational cost and degradation in token representation (Zouhar et al., 2023;Liu et al., 2021c;Yu et al., 2021;Demeter et al., 2020;Wang et al., 2023).For instance, recent large language models such as GPT3 may require roughly two tokens for the representation of an eight-letter length word 2 .This fragmentation also leads to tokens such as [ soci ], [ _al ] or [ _iso ], which are often shared by many words and lead to underrepresentation, preventing the fine-tuned model to better learning the compositionality of generation (Liu et al., 2021c;Dou et al., 2021;Yu et al., 2021).Instead, if we could allow for more cohesive information to be represented in a token, including for instance task-specific tokens such as [ social_isolation ], we could potentially reduce the computational cost and achieve stronger token representation.

Task-adaptive Tokenization
Our approach to task-adaptive tokenization consists of three main steps: 1. Task Vocabulary Construction: First, we compile a task-specific vocabulary (Sec 4.1) by leveraging a subword regularization algorithm.2. Vocabulary Merging: Next, we merge the task-specific vocabulary with the original vocabulary from pre-trained models (Sec 4.2). 3. Token Mapping: Finally, we create new token embeddings by mapping the new token to the sub-words in the pre-trained vocabulary and averaging the sub-word embeddings (Sec 4.3).

Task Vocabulary Construction
To construct a task-specific vocabulary that allows for variable segmentation, as described in Section 3, we use subword regularization (Kudo, 2018).Subword regularization optimizes the likelihood of the training corpus on all possible text segmentations and produces a vocabulary that consists of a set of tokens and their corresponding log-likelihood scores.Specifically, this algorithm leverages a regularization coefficient to increase the sampling probability of low-score segmentations during training to learn the representations of various segmentations.This allows for sampling of a certain segmentation among various possible segmentations based on the score of the text being segmented.To adapt the original algorithm to our setting, we use task-specific data (i.e., all the response sentences from the QA pairs in the PsyQA task) to train a unigram language model.In addition, contrary to the original algorithm, we do not 2 platform.openai.com/playgroundsplit sentences into words, as we want to include segmentations of various granularities.
To illustrate, consider the sentence "a sense of purpose in life" as an example.The original model segments it into subwords as follows: With our modification, the model is also able to produce the following additional segmentations: Alg 1 shows the task vocabulary construction process, where underline text indicates our modifications to the original algorithm.

Algorithm 1: Task-specific Vocabulary Construction
Data: vocabulary with size N , where t i is the ith token and l i is the corresponding score.
1 D −→ P = {λp : p ∈ pieces cut at any length} ; , where S(t) is all possible segmentation for s; 4 Compute the loss i of each t i in V big ; 5 Sort (t 1 , loss 1 ), ...(t M , loss M ); The process followed in each step in the algorithm is as follows: 1. Divide all sentences s ∈ D into various granularity pieces; 2. Choose the most frequent pieces and the union of characters up to a big seed vocabulary V big = {t 1 , .

Vocabulary Merging
After creating a task-specific vocabulary, an important challenge is how to incorporate this "alien" vocabulary in the pre-trained models, as such models already have their own vocabulary and corresponding embeddings.To this end, we build a protocol for merging a task-specific vocabulary with a pre-existing vocabulary (referred to as original vocabulary), as shown in Fig 2.
The design of such a merging protocol considers two aspects.First, to inherit the embedding matrix of the pre-trained models, the order of all tokens from the original vocabulary is maintained in the merged vocabulary.In this way, new representations could be added to the embedding matrix by extending the rows of the original one (Rule 4 in Fig 2).Second, special tokens and the tokens from the original vocabulary that are constructed based on unique segmentation algorithms (e.g., WordPiece and BPE) do not have a score for sampling.Thus, we have to assign them an appropriate score based on our guidelines (Rules 1, 2, and 3 in Fig 2).We assign −bigScore * len(token)+1 len(token) to those tokens qualifying for Rule 2, where the bigScore ought to be lower than the lowest score among task-specific tokens, to ensure taskspecific tokens have a higher priority to be sampled; meanwhile, longer tokens will receive a bigger score than their sub-string tokens, following the BPE/WordPiece design that prioritizes longer segments.Please see token statistics for a merged vocabulary in Appendix B.

Token Mapping
To address the possible difficulties of representation learning for new tokens that are never seen during pre-training, we propose a simple but effective initialization mechanism to alleviate the problem.For each new token, we acquire its subword A score is calculated for non-overlapping tokens from the original vocabulary.
Overlapping tokens receive the score calculated in the task-specific vocabulary.
Non-overlapping tokens from the task-specific vocabulary are appended to the end.embeddings by using the primary pre-trained tokenizer to segment the new token into a set of subwords.The initial embedding of new tokens is set as the average of its subwords' embeddings from the original embedding matrix.For overlapping tokens, we leverage the existing embedding representations.

Experiments
We assess the effectiveness and efficiency of our task-adaptive tokenization on the PsyQA task through several automatic and manual evaluations.

Datasets
CN PsyQA is a Chinese dataset of psychological QA support, where the answers are provided by well-trained volunteers and counselors (Sun et al., 2021).This dataset contains 22,346 questions and 56,063 answers, where each question relates to one or more ground-truth responses.MHP Reddit is an English dataset of Post-Response pairs crawled from Reddit, where responders self-identify as mental health professionals in their profiles (Lahnala et al., 2021).Compared to CN PsyQA, this dataset is smaller in size (9,501 QA pairs), and the responses are relatively shorter.

Compared Methods
We compare our task-adaptive tokenizer with the tokenizers from off-the-shelf pre-trained models.We acknowledge that the implementation details of a tokenizer are bound to the pre-trained model type and corpora.We provide details of the tokenizers we compare against in Tab 1, includ-ing the associated model type, vocabulary size, and the segmentation algorithm.For brevity, we use the notations [LLM base ] and [LLM T aT ] under PsyQa or MHP datasets to represent the corresponding base and task-adaptive tokenizers for a model named LLM , respectively.Accordingly, we append the w/o mapping to the notations to indicate whether the mapping mechanism is used.

Experiment Details
As our backbone models, we adopt the small versions of GPT2 (Radford et al., 2019) and BART (Lewis et al., 2019) to analyze the generative improvements in both decoder-only and encoderdecoder models.Additionally, to study the effectiveness of our approach on larger language models, we include a 7B LLaMA model in our experiments (Touvron et al., 2023).However, due to limited resources, we were only able to conduct experiments on the Chinese version of LLaMA (Cui et al., 2023).For training, we create 8:1:1 splits of the datasets (section 5.1) and fine-tune our small backbone models (i.e., GPT2 and BART) while only training the LoRA adapter (Hu et al., 2022) and the input & output embeddings of LLaMA.Additional details for training and generation are included in Appx C.

Automatic Evaluation
Effectiveness.We use the HuggingFace evaluation tools (Wolf et al., 2019) and report Bleu-1 (B-1), Bleu-3 (B-3), average score of Bleu-1, 2, 3 and 4 (Bleu) (Papineni et al., 2002) and RougeL (Lin, 2004) for the generation on the test set.7.6 † +13.4% 27.9 † 2.5 10.1 -6.5% +mapping 6.7 +0.0% 22.9 3.0 10.9 +0.9%  2021), who fine-tuned GPT2 with auxiliary support strategy information.† indicates a significant improvement over the base (pvalue < 0.05).2020), we report these results as a reference since they are commonly used by previous research (Gu et al., 2023;Cao and Wang, 2022;Yue et al., 2021).In addition, we believe that the multiple gold responses provided for each question in the CN PsyQa dataset alleviate the shortcoming of reference-based metrics to a degree.We leverage character-level comparisons for Chinese and word-level comparisons for English.To establish a fair comparison, we apply an NLTK word tokenizer (Bird et al., 2009) to all generated responses.
From Tab 2, the task-adaptive tokenizer consistently outperforms the baseline tokenizers, with a maximum increase of 37.1% on Bleu and 74.8% on RougeL.The results demonstrate two important insights.First, task-adaptive tokenization shows a larger increase on Bleu-3 than Bleu-1, indicating that variable segmentation can enhance the expression of task-specific phrases.Second, the increase in RougeL suggests a successful retrieval of task-specific expressions from the dataset.
However, since the results from the automatic evaluation do not indicate large improvements for the mapping mechanism on CN PsyQA, we turn to human evaluations in Sec 5.5, and the results demonstrate that the mapping mechanism is important for the generation quality in human perception.To also gain insights into the weaker effectiveness of task-adaptive tokenization on Reddit MHP, in addition the human evaluation conducted to validate its effectiveness, in Sec 6 we extend our experimentation by creating a parallel English corpus, translated from CN PsyQA.This addition allows us to probe the underlying factors contributing to the limited improvement observed in Reddit MHP -whether the disparity in performance can be attributed to linguistic differences (English versus Chinese) or to disparities in data quality observed within the Reddit MHP dataset.Efficiency.In order to assess generation efficiency, we employ various metrics on the test set, including the average number of centiseconds per generation (#cSec), the average number of tokens per generation (#Tok), response length (Len), generation length per token (Len/#Tok) and generation length per centisecond (Len/#cSec).To calculate response length, we consider tokenagnostic measures, such as the number of characters for Chinese and the number of words after whitespace-splitting for English.The token utilization rate is then derived by dividing the number of tokens by the response length.
Tab 3 indicates that by using task-adaptive tokenization, models trained on Chinese and English datasets use significantly fewer tokens to represent more content.This enhancement is more apparent in Chinese, as its generation length per token is increased from 0.79 to 2.00, indicating a more than double time increase in token utilization rate.However, the results only show a significant improvement in generation speed for the Chinese dataset.We believe this occurs as responses in the MHP Reddit dataset are rather short and thus benefit less from using fewer tokens for generation, which is compensated by the increased time consumption from the expanded embedding layer.Vocabulary Size.We also investigate the influence of different task-specific vocabulary sizes on generation quality (Appx D).The results indicate that an optimal size of vocabulary for merging may exist, but the difference may be subtle.

Human Evaluation
We recruit five native professionals for the human evaluation of the Chinese and English results, respectively.Prior to the evaluation, participants were provided with ten QA pairs that were considered the "gold standard."They were instructed to familiarize themselves with the wording, structure, and language style of these gold responses, serving as a calibration exercise to assess the professional expression of a response and how closely it aligns with the standard responses.Each participant underwent ten rounds of evaluation under a guideline (see Fig 4).In each round, they were presented with a set of posts and corresponding response triplets, comprising the responses from GPT2 base , GPT2 TaT w/o mapping, and GPT2 TaT with mapping.The participants were then tasked with ranking the three responses based on three aspects: (1) Fluency: the response's fluency and readability, (2) Coherence: the responsiveness of the response to the post's content and its logical consistency, and (3) Professional expression: the proximity of the generated response to the standard responses in terms of wording, structure, and language style.
From the findings presented in Tab 4, the inclusion of a mapping mechanism is crucial for ensuring a robust token representation system, particularly when dealing with small-scale data (MHP Reddit).Without this mechanism, the generated responses exhibit a significant decline across three aspects, despite an increase in automatic evaluation scores.Moreover, our tokenization approach with the mapping mechanism outperforms the baseline on CN PsyQA in human evaluation, even though this improvement is not reflected in the results of automatic evaluation.

Performance on Large Language Models
We investigate the benefits of our task-adaptive tokenization on the effectiveness and efficiency generation for the recent LLMs.Tab 5 shows the results when using the 7B LLaMa model, as described in Sec 5.3.The RougeL score increases by 15.0% when applying our tokenization, which indicates that our task-adaptive tokenization brings about additional and model-agnostic performance benefits.Further, we investigate whether the disparity in constitution analysis between the two datasets arises from linguistic distinctions or data quality concerns.As highlighted in the description of the datasets in Sec 5.1, participants in the MHP Reddit dataset primarily self-identify as professionals in their profiles.Additionally, we observe that many QA pairs in Reddit MHP resemble general chitchat, where task-specific terminology may be less prevalent.To address this, we translate 5,118 QA pairs from the CN PsyQA dataset into a parallel corpus, split into an 8:1:1 ratio (training, development, test).With this, we aim to reassess the effectiveness of our proposed tokenization techniques within an English data context.As illustrated in Tab 7, task-adaptive tokenization markedly enhances generation effectiveness across all metrics.Based on these results, we conclude that our proposed tokenization method performs effectively in tasks involving frequent use of domain-specific expressions, compared to open domain communication.

Related Work
Segmentation Algorithms and Vocabulary Development.Mapping text into tokens, as a key step in the NLP pipeline, has a rich history of algorithms and vocabulary development.Early proposals of segmenting text varied from utilizing linguistic cues at different levels (e.g., morpheme or syntax) to statistical methods (Creutz and Lagus, 2006;Luong et al., 2013;Zouhar et al., 2023).Notably, during the era of statistical machine translation, phrase-level translation, which shares a similar idea with variable segmentation, was one of the most promising translation methods at that time (Koehn et al., 2007(Koehn et al., , 2003)).This paradigm enjoyed considerable popularity until the rise of deep learning techniques, shifting the focus to subword-level segmentation, given the need to address the challenge of poor representation of rare/sparse words (Sennrich et al., 2015(Sennrich et al., , 2016;;Kudo, 2018;Kudo and Richardson, 2018).This approach largely improves performance of NLP models by leveraging shared subword units and maintaining a compact vocabulary.In parallel, the use of vocabulary transformed with the advent of large language models (LLMs).Previously, each model tended to develop its own vocabulary in isolation (Jean et al., 2015), but recent work started to directly use the vocabulary of pre-trained models to inherit the strong representations acquired through pre-training.Despite existing work (Bagdasaryan et al., 2022;Liu et al., 2021d), customizing vocabularies for specific tasks lost popularity due to challenges in integrating them into nowadays pretraining-finetuning paradigm.
Recently, research has been proposed to for tokenization quality evaluation (Zouhar et al., 2023) and token cost analysis of a tokenizer among different languages (Ahia et al., 2023), indicating the researchers' increased concerns on tokenization improvement.It is worth noting that Liu et al. (2021d); Sachidananda et al. (2021) also addressed the vocabulary gap between pretraining and finetuning or domain-level post-pretraining; however, their solutions either requires an additional model module for token alignment or solely operates at the sub-word level.In contrast, our work provides a model-agnostic solution and embraces the merits of flexible variable segmentation cherished in earlier research while still retaining the ability to leverage existing pre-trained models.
Generation in Mental Health Support.In recent years, several studies have explored the application of generation techniques for mental health support (Shen et al., 2022b;Lee et al., 2021;Sabour et al., 2023;Hsu et al., 2023;Liu et al., 2021b), including counseling-style dialog generation systems (Shen et al., 2020) and the incorporation of counseling strategies in response generation (Sun et al., 2021).Furthermore, recent work has investigated the use of large language models as expert systems for support strategy counseling (Zhang et al., 2023).However, rather than focusing adaptation effort on fine-tuning or prompting, our study focuses on tokenization, an easily overlooked component in the NLP pipeline.We hypothesize that for tasks in technical writing fields, e.g., mental health, adapting the tokenization to the downstream language style is a potential strategy to unlock additional performance.

Conclusion
In this paper, we proposed task-specific tokenization as a way to adapt the generation pipeline to the specifics of a downstream task.We demonstrated the efficiency and improved long-form quality of generation for the domain of mental health, where we specifically addressed the task of psychological question answering.Our tokenizer leverages the specifics of the downstream task data, while still retaining the ability to integrate into existing pre-trained models.Through extensive experiments, we demonstrated the ability of taskadaptive tokenization to enhance both the effectiveness and efficiency of long-form generation.
We believe our work is particularly useful in the era of large language models (LLMs), as the proposed task-adaptive tokenizer can lead to significant improvements while being domain and model agnostic.Based on our findings, we suggest plugand-play tokenization for LLMs when performing specific generation tasks.

Limitations
Despite the strength of our proposed task-adaptive tokenization, several limitations remain.In particular, due to limited resource, we were only able to test it on one dataset and on a large-scale language model.Future work should consider evaluating the effectiveness of our task-adaptive tokenizer on additional domains and LLMs.The effectiveness of this tokenization should also be verified on additional languages and models of various sizes.Finally, in our experiments, we found our tokenization does not significantly enhance the generation speed in English, which may be due to the fact that the English vocabulary has less room to increase its granularity compared to a characterbased language like Chinese.

Ethics Statement
Several ethical situations should be considered.First, due to the black-box feature of neural networks, we do not recommend any generation technique including our proposed method to be directly used by a mental health supporter.Instead, in practice, it could be exposed to well-trained MHPs informed by the pros and cons of using generative AI, and offered as a suggesting tool for professionals.The professionals should be fully informed and trained to be accountable for the second-edited content.Second, during human evaluations, we informed all the human evaluators that the responses they would see might cause some discomfort, so they could decline their participation in the evaluation.Finally, regarding the use of the PsyQA data, we have received authorization to use the data and conduct research on it ethically and legally through approval from the data publishers.Due to length restrictions, using more than one PsyQA example in the prompt was not allowed for longer post-response pairs, making the fewshot in-context learning for this task impractical.
In addition, in cases where we were able to provide more than one example as the prompt, the model generally produced low-quality incoherent responses that were inconsistent with the input context (i.e., user's post).We believe this is expected as the model would have difficulties distinguishing between the posts in the few-shot examples and the post that requires a generated response.
Misalignment with language behavior of MHPs.The language style of LLMs is adopted from their corresponding pre-training data and human feedback, which may not meet the requirement of linguistic behaviors of MHPs.In other words, despite the significant performance of recent LLMs, desirable performance could only be achieved through finetuning the model on professional data that captures the rich MHP language style and patterns.

B Statistics on tokens at different lengths in a merged vocabulary
Table 8 presents the top-scored tokens at various different lengths to show that the task-adaptive tokenizer successfully includes many task-specific terminologies and expressions in its vocabulary.

C Experiment Details
Except for the experiments on vocabulary size, we adopted a uniform size of task-specific vocabulary for task-adaptive tokenizer construction by merging a 10k task-specific vocabulary and the original vocabulary of each pre-trained model.
During training, the regularization coefficient for sampling segmentation among various results was 0.5.We use one Nvidia GPU A40 for training GPT and Bart, which can load eight samples with a padding length of 1024

D Vocabulary size
We reported the generation effectiveness cross different task-specific vocabulary sizes in Table 10.

E Case Study
For each generation setting, we present one generated case.The results are in Table 11 and Table 12.From the result on MHP Reddit, we see due to the scarcity of the data, generation quality is hurt by poor representation of newly-add tokens while fluency, coherence, and professional expression are recovered after the application of the mapping mechanism.

NM B
Context [QUE] can we talk about dissociation?<|sep|>[DESC]i'm just wondering what other people's experiences with dissociation are, and if there are different types?personally, my way to dissociate is kind of an attempt to evoke/"be" stronger versions of myself (always male versions though -which i find interesting -i've no desire to be male irl, just in my head) when i'm under duress of some sort, usually pain or anxiety. is this unusual?my boyfriend, on the other hand, calls it dissociation when he kind of "zones out", he says his eyes glaze over and he stops doing anything or really thinking anything (also in times of stress ., "Social isolation and not having a sense of purpose in life have been linked to mood disorders" Unique Seq (16 tokens) [ ha ] [ ving ] [ a ] [ sense ] [ of ] [ pu ] [ rp ] [ social ] [ iso ] [ la ] [ tion ] [ and ] [ not ] Seq 1 (9 tokens) [ social isolation ] [ and ] [ not having ] Seq 2 (5 tokens) [ social ] [ isolation ] [ and ] [ not ] [ ha ] [ ving ] [ a sense of ] [ purpose ] [ in life ] [ a sense of purpose ] [ in life ]

Figure 1 :
Figure 1: A brief comparison between task-adaptive and traditional tokenizer.With task-adaptive tokenization, the same phrase appearing in different training instances may result in different tokenizations.For instance, Seq 1 and 2 illustrate two different token sequences of the same input using a task-adaptive tokenizer.

Figure 2 :
Figure 2: An overview of the merging protocol.
.., t M } where M >> N ; 3. Build a unigram model on D. The probability of a possible segmentation {t 1 , ..., t K } on s is P (t) = K 1 p(t i ).The most possible segmentation is optimized by t Compute the loss i of each t i in V big , where loss i represents how the likelihood L has reduced when the piece t i is removed from the current vocabulary; 5. Sort the tuples (t 1 , loss 1 ), ...(t M , loss M ) in the descending order; 6. Keep the union of characters and the pieces with the highest loss score until it satisfies the target vocabulary size N ; Get the final vocabulary V = {(t 1 , l 1 ), ..., (t N , l N )}, where l i is the log probability of t i in the unigram model.
* = argmax t∈S(t) P (t), where S(t) is all possible segmentations for s.Apply EM algorithm to unigram model with the objective function L = |D| s log( t∈S(t s ) P (t)); 15267 4.

Table 2 :
Generation effectiveness.Bleu is calculated by averaging B-1,2,3,4, where B-n denotes the Bleu ngram precision.R-L is RougeL score.+pct denotes the percentage of improving scores corresponding to Bleu and RougeL over the base.* indicates the sota results reported by Sun et al. (

Table 3 :
Efficiency of generation.#cSec and #Tok denote the average number of centiseconds and tokens per generation on the test set respectively.Length denotes the average length of generated responses.

Table 4 :
Human Evaluation.An explanation for abbreviations: M for GPT2 TaT +mapping, B for GPT2 base , and NM for GPT2 TaT w/o mapping; F for fluency, C for coherence, and PE for professional expression.Ties are not shown.
† denotes a significant win (one sample sign test, p-value < 0.05).

Table 5 :
Generation effectiveness on Chinese LLaMA.

Table 6 :
Length contribution of three types of tokens generated on both datasets.

Table 7 :
Generation effectiveness on Translated PsyQA.See Tab 2 for column and notation definition.
A Few-shot Inadequacy in PsyQA task Limited Input Size and Inconsistent Results.

Table 8 :
Statistics of tokens with the highest log-likelihood scores in each length interval.The task-specific vocabulary for statistics is built on MHP datasets.

Table 9 :
Decoding Parameters ). are we both correct in calling what we experience "dissociation"?what's your experience?i understand this is kind of a taboo/sensitive topic, i'm just now getting to the point where i can openly talk about it, but i'm desperately curious about it.edit: thanks everyone for sharing, it's a fascinating subject.it's sounding to me like what i experience is different from "normal" dissociation somehow, and i'm going to investigate this further.<|sep|>[KWD] bpd<|sep|>

Table 12 :
Case study on MHP Reddit.An explanation for abbreviations: M for GPT2 TaT +mapping, B for GPT2 base , and NM for GPT2 TaT w/o mapping;