SideControl: Controlled Open-domain Dialogue Generation via Additive Side Networks

Transformer-based pre-trained language models boost the performance of open-domain dialogue systems. Prior works leverage Transformer-based pre-trained language models to generate texts with desired attributes in two general approaches: (1) gradient-based methods: updating all latent representations of pre-trained models with gradients from attribute models; (2) weighted-decoding methods: re-ranking beam candidates from pre-trained models with attribute functions. However, gradient-based methods lead to high computation cost and can easily get overfitted on small training sets, while weighted-decoding methods are inherently constrained by the low-variance high-bias pre-trained model. In this work, we propose a novel approach to control the generation of Transformer-based pre-trained language models: the SideControl framework, which leverages a novel control attributes loss to incorporate useful control signals, and is shown to perform well with very limited training samples. We evaluate our proposed method on two benchmark open-domain dialogue datasets, and results show that the SideControl framework has better controllability, higher generation quality and better sample-efficiency than existing gradient-based and weighted-decoding baselines.


Introduction
With the advance of Transformer-based pre-trained language models (Radford et al., 2019;Raffel et al., 2020;Brown et al., 2020;Zhang et al., 2020), many dialogue systems (Zhang et al., 2020;Roller et al., 2020;Shuster et al., 2020) have shown promising performance in challenging open-domain conversations with humans. However, for controlled dialogue generation, prior works mainly focus on building LSTM-based class-conditional generative model on specific datasets with task-specific design on model architecture (Wen et al., 2015;Ke et al., 2018;Chen et al., 2019;See et al., 2019) or policy learning strategy (Kawano et al., 2019;Hsueh and Ma, 2020;Takayama and Arase, 2020;Varshney et al., 2021). In this work, we explore effective method for controlled generation on Transformerbased dialogue systems, with the goal of adding controllability functionality into state-of-the-art Transformer-based dialogue systems with lower computation cost, less training data and more flexible control mechanism.
Prior works on controlled text generation for Transformer-based pre-trained language models can be categorized into two general approaches: (1) gradient-based methods and (2) weighted-decoding methods. The gradient-based methods (Dathathri et al., 2019;Goswamy et al., 2020;Lin and Riedl, 2021) propose a plug-and-play language model following p(x|a) ∝ p(a|x)p(x), which plugs an attribute model p(a|x) with a pre-trained language model p(x) to control generation. The gradients from p(a|x) are used to guide the latent representations of pre-trained models encoding more control attribute information. The weighted-decoding methods (Ghazvininejad et al., 2017;Baheti et al., 2018;Holtzman et al., 2018;Yang and Klein, 2021) modify the sampling weights with attribute functions in beam search at each decoding timestep to control generation. Essentially, the attribute functions are used to re-rank the original beam candidates generated by the pre-trained language models. The main idea of both gradient-based methods and weighted decoding methods is the flexibility: users can design any attribute models or functions for different controlled generation tasks and apply the attribute model or function to any state-of-the-art pre-trained language models for generating high quality texts.
However, weighted decoding methods (Ghazvininejad et al., 2017;Baheti et al., 2018; (a) SideNet for Knowledge Document Control  Holtzman et al., 2018;Yang and Klein, 2021) are limited by the low-variance high-biased pre-trained language models, since they do not update the pre-trained language models. If the pre-trained model yields commonly observed words rather than target attribute words in the beam candidates list, it is difficult for the attribute functions to re-rank and find the target words during generation. Although gradient-based methods (Dathathri et al., 2019;Goswamy et al., 2020;Lin and Riedl, 2021) do not have this limitation since they update the latent representations of pre-trained models during inference, the gradient propagation at each decoding timestep involves heavy computation, which results in slow response speed to users. In addition, the controllability performance of gradient-based methods relies on the attribute model. If the attribute model gets overfitted on a small training set, the gradient from this attribute model will just lead to meaningless updates.
To build an effective and efficient controlled open-domain dialogue system, we propose the SIDECONTROL framework, which treats the pretrained lanaguage model as a feature extractor and train light-weight side networks to encode complementary information from control attributes. In addition, we introduce a novel control attributes loss to guide the side network during training. As shown in Figure 1, the final output representation is a mixture of a base representation from the pretrained language model and a side representation from the side network. The mixture coefficient α is learned during training, and is used to balance the prior knowledge from the base network and the task-specific control attributes signals from the side network. From the encoding perspective, the SIDE-CONTROL framework not only can be applied to any pre-trained language models, but also supports diverse format attributes control (e.g. dialogue act, external knowledge document). From the decoding perspective, the SIDECONTROL framework has low computation cost, since it directly samples from its optimized class-conditional language model p(x|a) without additionally updating latent representations during generation. From the sample-efficiency perspective, the SIDECONTROL framework achieves good performance with a few thousand training samples by leveraging the control loss.
We summarize the contributions of this work as follows: 1. we propose a new controlled dialogue generation framework with novel control attributes losses to support different forms of attributes control (e.g. dialogue act, external knowledge document); 2. we conduct empirical experiments to show the sample-efficiency of the SIDECONTROL framework, which can achieve good performance with only 100 ∼ 1000 training samples;

SideNet for Controlled Generation
Firstly, we introduce the SIDECONTROL framework in subsection 2.1, which presents the general idea of using a small side network to coordinate the generation process based on large-scale pre-trained language models (Zhang et al., 2020;Roller et al., 2020;Shuster et al., 2020). Then we provide two realizations of side networks for two types of control attributes: (1) external knowledge document in subsection 2.2, (2) semantic label in subsection 2.3.

General Framework
Given a dialogue context which contains a fixed number of previous utterances where N is the total number of tokens in the given dialogue context, and a control attribute a which represents the desired controllable attributes, the goal is to build a model conditioned on X and a that can generate a response which best approximates the ground-truth human response where h t is the last hidden state of the generative model at decoding timestep t. The SIDECONTROL framework consists of a large base network B(·) providing rich feature representations and a small side network S(·) encoding control attribute(s), as illustrated in Figure 1. The base network B(·) can be any pre-trained language models (Zhang et al., 2020;Roller et al., 2020;Shuster et al., 2020). Given dialogue context x 1:N as the input to the base network, we just take last hidden states {h t b } T t=1 for the response {y t } T t=1 from the base network as our base representations: The side network S(·) is a light-weight neural network, which encodes the control attribute a into base representations {h t b } T t=1 : Finally, we keep the base representation h t b fixed, and add the side representation h t s upon it to obtain the final combined representation h t for the current token y t : where W vocab is learnable parameters, and the mixture coefficient α is also learned during training, which aims to encode both useful prior knowledge from pre-trained language models and important attribute information from target dataset for controlled generation. We provide detailed implementations for the side network S(·) and mixture coefficient α in subsection 2.3 and subsection 2.2.
The main challenge in this framework is to teach the side network S(·), such that it can provide complementary information of control signals via h t s during generation, since the pre-trained language models can already generate fluent responses. To address this challenge, we intentionally freeze the parameters of the base network B(·) when training the side network. Otherwise, it is essentially training a large neural network model even deeper than B(·). Second, we introduce the control attribute loss L control , which is designed to teach the side network explicitly encoding control signals to improve the controllability of the model. The final objective is a combination of class-conditional language modelling loss L cclm and task-specific control attributes loss L control : where λ is a task-specific hyper-parameter, and detailed implementations of L cclm and L control are described in subsection 2.2 and subsection 2.3, L control has different implementation when controlling different forms of attributes.

Knowledge Document Control
When having external knowledge documents as the control attributes, such as persona profile (Dinan et al., 2020), Wikipedia articles (Dinan et al., 2018), etc., the format of control attribute is sequences of tokens a = {k i } K i=1 , where K is the total number of tokens in the external knowledge document. In this case, we model the knowledge document representation with a single-layer bi-directional LSTM: The side network is designed to align the controlled knowledge document representation {h i k } K i=1 with the base representation h t b at each decoding timestep. We compute the cross-attention between {h i k } K i=1 and h t b following (Bahdanau et al., 2014): where W k ∈ R D×D , W b ∈ R D×D and b kb ∈ R D are learnable parameters. The attention a t is a probability distribution over the controlled knowledge document that tells the decoder where to look at when generating the next word, and the context vector c t k represents what has been read from the controlled knowledge document representation at decoding timestep t. The final side representation h t s incorporates the context vector c t k into the base representation h t b : where we concatenate c t k and h t b , and W c ∈ R 2D×D and b c ∈ R D are learnable parameters.
Since the controlled knowledge document is different per utterance, we implement the mixture coefficient α based on the side representation h t s and base representation h t b at decoding timestep t: where we concatenate h t s and h t b , and W α ∈ R 2D×1 and b α ∈ R are learnable parameters.
In order to encourage the decoder generating more words from the knowledge document, we adopt the copy mechanism from (See et al., 2017) to formulate L cclm : where we concatenate c t k and h t b , and W β ∈ R 2D×1 and b β ∈ R are learnable parameters. h t comes from Equation 13. y * t is the ground-truth word at decoding timestep t. K i=1 a t i is the summation of attention distribution over the knowledge document at current decoding timestep t, which will assign higher probability for attended knowledge document words in the final word probability distribution.
The control attributes loss for this task is used to encourage generating more non-repetitive words from the knowledge document. We adopt the coverage mechanism from (See et al., 2017) to formulate L control : where a t i is the attention weight of knowledge document word k i at previous decoding time step t . L control penalizes the overlap between current attention distribution and previous attention distributions, which prevents the model repeatedly attending to the same word in the knowledge document. For more details about the copy mechanism and coverage mechanism, please refer to the original paper (See et al., 2017).

Semantic Label Control
When having a semantic label as the control attribute, such as dialogue act (Li et al., 2017), emotion (Rashkin et al., 2019), etc., we implement the side network as a simple feed-forward neural network: where we concatenate W a a and h t b , W a ∈ R 1×D is an embedding matrix that maps the discrete label a to a continuous representation, W d ∈ R 2D×D and b d ∈ R D are learnable parameters. The mixture coefficient α ∈ [0, 1] is a global parameter which is learned during training, in order to encode both useful prior knowledge from pre-trained language models and control signals from semantic label. y * t is the ground-truth word at decoding timestep t.
The control attributes loss L control for this task is used to modify the final latent representations so that the model can generate responses with the target control attribute. However, it is difficult to directly measure how much control attribute information has been encoded into the side representation. Therefore, we approximate it using a independent attribute classifier p(a|h 1:T ). When training the side network, we keep the attribute classifier fixed and feed the side representations {h t s } T t=1 into the classifier. The classifier will return a loss between the current side representation and the target control attribute a * , and optimizing this loss will update the side representation h t s towards obtaining a higher p(a * |h 1:T ): Note that W clf ∈ R D×K is independently learned on the same training set based on the base representation {h t b } T t=1 , but is fixed when we update the side network.

Evaluation Methods
In this work, we focus on evaluating the controllability and text quality of different controlled generation methods. Additionally, we prefer to have lower decoding cost and better modularity in order to apply the proposed method into more possible applications. Therefore, we use the following automatic metrics to evaluate the performance: Controllability 2 : this is our main metric. It aims at evaluating whether the proposed method can successfully generate the target controlling attributes.  et al., 2020) and BlenderBot (Roller et al., 2020). Ideally, we expect as good or even better performance when switching the base network from Di-aloGPT to BlenderBot, since BlenderBot has been trained on larger dialogue corpus that is likely to provide more informative base representations.

Competitive Baselines
We compare the SIDECONTROL framework with the following competitive baselines:  language model. It is used to show the high modularity of our side network.

Knowledge Document Control
In this task, given the previous dialogue context and the external knowledge document for the current speaker, the model will generate one utterance that is relevant both to the context and to the knowledge document. We provide the detailed experiment setups in Appendix B.
Dataset. We use the ConvAI2 dataset (Dinan et al., 2020) for the knowledge document control task. We set the previous 4 utterances as the dialogue context. Each utterance is linked to its corresponding persona profile. Since the test set of ConvAI2 has not been made public, we use the original training set to construct our training set, and split the first 80% original validation set as our validation set and the remaining 20% original validation set as our testing set. In total, we have 153,082 training samples, 38,271 validation samples and 11,590 testing samples.
Performances under Full Data. Table 1 shows that DialoGPT+SideControl outperforms all other baselines in controllability, which validates the effectiveness of the SIDECONTROL framework. For the quality of the generated texts, we find that both FUDGE and PPLM perform worse than the original pre-trained language model, while the SIDECON-TROL shows improved quality because of the L cclm during training. We also notice that direct finetuning gives the best performance in BLEU-1 and BLEU-2, but worse controllability compared with the SIDECONTROL. This is because direct finetuning only focuses on optimizing the language modelling loss, and does not take the control attributes information into account. For the decod-ing cost, our SIDECONTROL is around 6x faster than PPLM during generation, which shows its efficiency during inference. Finally, we find that the performance improvement in controllability and text quality also hold when we apply the SIDE-CONTROL to BlenderBot, which shows the flexible modularity of the side network.
Performances under Small Data. With the goal of testing the sample-efficiency of the SIDECON-TROL framework, we train all baselines under smaller datasets, where we randomly sample 100, 1000, 5000 and 10000 training samples from the original training set to train the model, and evaluate the model performance using the full testing test. Figure 2 shows the controllability performance under different training sizes, and we provide detailed text quality performance in Appendix E. We find that SIDECONTROL only underperforms PPLM in 100 training samples, since PPLM uses nonparametric bag-of-words features as its attribute model while SIDECONTROL uses a BiLSTM as its attribute model. And 1000 training samples are sufficient enough for SIDECONTROL to achieve comparable performance with PPLM. In addition, SIDECONTROL constantly achieves performance improvement when increasing the training size.
Ablation Study. To verify the effectiveness of the control loss L control , we conduct ablation study by trying out different values of λ in Equation 6. We provide partial results in Table 3 and full results in Appendix D. When λ = 0, the model becomes a vanilla language model and takes no information from the side network, which leads to a low performance in controllability. When λ = 0, the model incorporates control attributes information from the side network, which leads to an improved performance in controllability. However, incorporating side information will lead to a slight increase in model perplexity.

Semantic Label Control
In this task, given the previous dialogue context and the current dialogue act, the model will generate one utterance that is relevant to the context and also satisfies the current dialogue act. We provide the detailed experiment setups in Appendix C.
Dataset. We use the DailyDialog dataset (Li et al., 2017) for the semantic label control task. We set the previous 5 utterances as the dialogue context and follow the standard train/validation/test splition of the original dataset to construct our generation dataset. In total, we obtain 35,781 training samples, 3,388 validation samples and 3,123 testing samples.
Performances under Full Data. Table 2 demonstrates that SIDECONTROL has better text quality than FUDGE and PPLM, since we explicitly optimize L cclm during training. For the controllability, PPLM achieves the best performance with a sacrifice of inference efficiency, while SIDECON-TROL can achieve comparable performance in controllability with around 24x faster decoding time.
Finally, the performance improvements in controllability and text quality still hold when we switch the base network from DialoGPT to BlenderBot, which demonstrates that the side network is flexible to be applied to different types of pre-trained language models. And surprisingly, BlenderBot can even provide the state-of-the-art performance in controllability.
Performances under Small Data. We also compare across the model performance under different training sizes following the same setup with the knowledge document control task, and provide detailed text quality performance in Appendix F. Figure 3 illustrates that SIDECONTROL achieves better controllability than PPLM when training size is under 1000. This is because PPLM uses a datadriven classifier as its attribute model in this task, and its attribute model gets overfitted on the 100 training samples, which results in poor controllability performance. Similarly, FUDGE has the same overfitting issue for its attribute discriminator on these small training sets, and gets unsatisfied     controllability performance. Although SIDECON-TROL also pre-trains a classifier on the 100 training samples to guide the update of side representation, its final representation is a combination of base and side representation. We believe incorporating prior knowledge from the base representation helps SIDECONTROL alleviates the overfitting issue on small training set.
Ablation Study. We also try out different values of λ to study the effect of control loss L control , as shown in Table 4. Full ablation study results are provided in Appendix D. When λ = 0, the model takes no control attributes signals from the side network during training, which results in a low controllability performance. When λ = 0, the controllability performance of the model is improved but with a slight increase in model perplexity. Both Table 3 and Table 4 verify the effectiveness of control loss L control in improving the controllability of pre-trained language models.
Human Evaluation.  4 We show the same dialogue context, current dialog act and two responses generated by model A and model B respectively, we randomly sample 50 dialogue contexts, and collect the corresponding model generated responses. Human annotators are recruited using Amazon Mechanical Turk and each response has 5 annotations. In total, we collect 2250 human annotations. Table 5 shows the results of text quality evaluation, and SIDECONTROL achieves the best fluency and context relevancy than PPLM and FUDGE. Table 6 shows the results of controllability evaluation, and SIDECONTROL wins over PPLM and FUDGE in 57% and 54% respectively. Both text quality and controllability evaluation show that SIDECON-TROL can generate more fluent, context-relevant and attribute-relevant responses than PPLM and FUDGE.

Related Works
There are three major categories of controllable text generation models: class-conditional language model ( controllable neural conversation model by leveraging an adversarial learning framework that alternatively trains between a class-conditional language model and a multi-class discriminator, where the discriminator is used to help the generative model produce responses with appropriate dialogue act. But the control code is modelled as discrete variable in this work, which limits the controllability capacity of the dialogue model.
Plug-and-Play Language Model. Guiding generation with gradients from additional attribute models is another popular approach. Dathathri et al.
(2019) introduce a plug-and-play language model (PPLM) which combines the pre-trained language model p(x) with attribute models p(a|x) to approximate the contional generative model p(x|a) . At each decoding timestep, all hidden representations of the pre-trained language model are shifted with gradients towards a higher p(x|a) ∝ p(a|x)p(x). The attribute models of PPLM are either in the form of bag-of-words or single layer classifiers, which requires much less training data than learning a conditional generative model. The following works (Goswamy et al., 2020;Lin and Riedl, 2021;Madotto et al., 2020) further propose more fine-grained attribute models and generation strategies for specific task, such as emotional text generation (Goswamy et al., 2020), story generation (Lin and Riedl, 2021) and conversation generation (Madotto et al., 2020). But since the plug-and-play language models have to compute gradient from attribute model and update hidden representations at each decoding timestep, the generation process is very time-consuming, which leads to high decoding cost.
Weighted Decoding. Weighted decoding runs a more expensive beam search where the sampling probability distribution is altered by desired control attributes, such as topic, sentiment, etc. Ghazvininejad et al. (2017) design a set of style features on controlling topic, sentiment, and repetitive words, and re-compute the beam score of each token with a combination of the original beam score and the style feature score. A recent work (Yang and Klein, 2021) introduces a Future Discriminator for Generation (FUDGE) that trains a binary discriminator for the control attribute prediction and re-scores the probability distribution of the original pre-trained language model with the discriminator prediction via Bayesian factorization. The major limitation of weighted decoding methods is that, if the pre-trained language model is a high-bias estimator, which assigns low probability for desired attribute words and high probability for commonly observed but unrelated words, re-scoring or re-ranking such a "high-biased" distribution cannot guarantee the generation of desired attributes.
The SIDECONTROL framework differs the above methods as follows: (1) the side network only requires access to last hidden states of the base network. Both class-conditional language models (Keskar et al., 2019) and plug-and-play language models (Dathathri et al., 2019) require access to every hidden states of the pre-trained language model, which limits its application under certain pre-trained model. (2) the side network learns a residual on top of pre-trained language models, which is suitable for small datasets. Directly finetuning (Ziegler et al., 2019) large pre-trained language models will cause overfitting issues on some small datasets, and weighted-decoding methods (Ghazvininejad et al., 2017;Yang and Klein, 2021) only modify the final vocabulary distribution of pretrained models but do not learn model parameters to better adapt to the target task.

Conclusions
In this work, we propose a new method for controlled dialogue generation: adding a small side network to incorporate useful control signals into the pre-trained language models. We design control attributes loss to teach the side network learning useful control signals. Empirical experiments show that our method is effective even with 100 ∼ 1000 training samples. Besides, our side network supports diverse forms of attributes control and can be flexibly applied to any pre-trained language models, which extends its possible application to other general controlled text generation tasks.

A Automatic Metrics for Controllability Evaluation
In this section we provide implementation details for how we compute classification accuracy and cosine similarity.

A.1 Dialogue Act Classifier
We train an independent dialogue act classifier to evaluate whether the current generated response matches its conditioning dialogue act. The input to the evaluation dialogue act classifier is a single response, and the output is a prediction of one of the 4 dialogues in DailyDialog, i.e. inform, questions, directives and commissive.
We construct the training corpus following the standard splition of original DailyDialog dataset, and obtain 87,170 training samples, 8,069 validation samples and 7,740 testing samples. We leverage the BERT model to provide a sequence of word representations and add a single-layer feed-forward neural network to predict the dialogue act of current sentence. We use AdamW (Loshchilov and Hutter, 2019) with learning rate 0.0001 to train this classifer. We set the batch size to 16, the total training epoch to 10 and automatically evaluate the model on the validation set very 5000 iterations. We save the model checkpoint with the lowest validation loss as the optimal model. This dialogue act classifier achieves 0.79 accuracy on the test set. Figure 4 shows the confusion matrix of this dialogue act classifier.

A.2 Computation of Cosine Similarity
To measure the similarity between the generated response and the conditioning knowledge document, we compute the cosine similarity between the wording embeddings of generated response and external knowledge document. The word embeddings are GloVe embeddings (Pennington et al., 2014) pretrained on Wikipedia 2014 and Gigaword 5, which are 100-dimension vectors and have 6 billion tokens 5 .
We use the NLTK word tokenizer 6 to tokenize the texts into a set of tokens, and remove stop words based on a pre-defined stop words list in (Bao et al., 2020). Finally, we compute the cosine similarity between the two sets of word vectors.

B Experiment Setups for Knowledge Document Control
We conduct all of our experiments on single GeForce RTX 2080Ti GPU server with 11019 MB memory.

B.1 Direct Fine-tuning
We directly update all parameters of the pre-trained language model on the ConvAI2 training set without having any side network or control attributes loss. For the training of the pre-trained language model, we use AdamW (Loshchilov and Hutter, 2019) with learning rate 0.0001. We set the batch size to 2, the total training epoch to 10, and automatically evaluate the model on the validation set every 1000 iterations. We save the model checkpoint which achieves lowest validation loss as the final optimal model. For generation, we follow the setup of FUDGE, which use top-k sampling with k = 10.

B.2 PPLM
For the implementation of the attribute model, we use the bag-of-words attribute model proposed in the original paper (Dathathri et al., 2019) to encode external knowledge document. We run the model on the ConvAI2 dataset using the code provided by the original paper: https://github.com/ uber-research/PPLM. We set the maximum generation length to 50, the number of gradient update steps to 3, the step size to 0.03, the window length to 5, the number of generated sentences to 1, γ gm = 0.99, λ KL = 0.01.

B.3 FUDGE
For the implementation of the attribute model, we use the bag-of-words attribute model proposed in the original paper (Yang and Klein, 2021) to encode external knowledge document. We run the model on the ConvAI2 dataset using the code provided by the original paper: https://github.com/yangkevin2/ naacl-2021-fudge-controlled-generation.
We set the maximum generation length to 80, the weight on conditioning model to 4.0, consider top 200 outputs from DialoGPT at each decoding timestep before conditioning, and sample from top 10 outputs from DialoGPT at each decoding timestep.

B.4 SideControl
For the implementation of the side network, we use a single-layer bi-LSTM which shares the same hidden dimension with the final hidden states of the base network. We tokenize the knowledge document using the same tokenizer with the base network, and share the same word embedding with the base network as well. For the training of the side network, we use AdamW (Loshchilov and Hutter, 2019) with learning rate 0.0001. We set the batch size to 4, the total training epoch to 10, and automatically evaluate the model on the validation set every 100 iterations. For the hyperparameter λ of the coverage loss in Equation 17, we use grid search on the validation set to obtain the optimal number. We search from the set λ = {10 −6 , 10 −5 , 10 −4 , 10 −3 , 0.01, 0.1} and find λ = 10 −5 yields best performance. For generation, we follow the setup of FUDGE, which use top-k sampling with k = 10.

C Experiment Setups for Semantic Label Control
We conduct all of our experiments on single GeForce RTX 2080Ti GPU server with 11019 MB memory.

C.1 Direct Fine-tuning
We directly update all parameters of the pre-trained language model on the DailyDialog training set without having any side network or control attributes loss. For the training of the pre-trained language model, we use AdamW (Loshchilov and Hutter, 2019) with learning rate 0.0001. We set the batch size to 2, the total training epoch to 10, and automatically evaluate the model on the validation set every 1000 iterations. We save the model checkpoint which achieves lowest validation loss as the final optimal model. For generation, we follow the setup of FUDGE, which use top-k sampling with k = 10.

C.2 PPLM
For the implementation of the attribute model, we follow the generic discriminator implementation in the original paper (Dathathri et al., 2019). We run the model on the DailyDialog dataset using the code provided by the original paper. We train a dialogue act classifier which takes single response as input and produces a prediction on one of the four dialogue acts. For the training of the classifier, we use Adam (Kingma and Ba, 2017) with learning rate 0.0001. We set the batch size to 64, the total training epoch to 10. For the generation of PPLM, we set the maximum generation length to 50, the number of gradient update steps to 10, the step size to 0.2, the number of generated sentences to 1, γ gm = 0.95, λ KL = 0.01.

C.3 FUDGE
For the implementation of the attribute model, we follow the attribute discriminator implementation in the original paper (Yang and Klein, 2021). We run the model on the DailyDialog dataset using the code provided by the original paper. We train a dialogue act discriminator which takes the dialogue context and the current response as input and produces a prediction on one of the four dialogue acts. For the training of the discriminator, we use Adam (Kingma and Ba, 2017) with learning rate 2×10 −5 . We set the batch size to 16, the total training epoch to 10. For the generation of FUDGE, we set the maximum generation length to 60, the weight on conditioning model to 1.0, consider top 200 outputs from DialoGPT at each decoding timestep before conditioning, and sample from top 10 outputs from DialoGPT at each decoding timestep.

C.4 SideControl
For the implementation of the side network, we use a single-layer feed-forward neural network which shares the same hidden dimension with the final hidden states of the base network. Besides, we pre-trained a dialogue act classifier to compute the control loss in Equation 22. We emphasize that this dialogue act classifier is different from the evaluation classifier. It models the sentence representation from the base network, i.e. DialoGPT, and adds a single-layer feed-forward neural network to predict the dialogue act of current response. We train this classifier using AdamW (Loshchilov and Hutter, 2019) with learning rate 0.0001 for 10 epochs. Then, we fix this classifier and begin to train the side network using AdamW (Loshchilov and Hutter, 2019) with learning rate 0.0001 for another 10 epochs. We evaluate the model on the validation set every 1000 iterations, and save the model checkpoint which has the lowest validation loss. For the hyper-parameter λ of the control loss in Equation 22, we use grid search on the validation set to obtain the optimal number. We search from the set λ = {1, 10, 100, 10 3 , 10 4 , 10 5 , 10 6 } and find λ = 10 5 yields best performance on the full training set. For generation, we follow the setup of FUDGE, which use top-k sampling with k = 10.

D Full performances of Ablation Study
We provide performance details for ablation study in knowledge document control and semantic label control. The full performances of ablation study in knowledge document control is shown in Table 7.
The full performances of ablation study in semantic label control is shown in Table 8.

E Full performances of Knowledge Document Control under Different Number of Training Samples
For all experiments across different number of training samples, we take the hyper-parameter λ = 10 −5 for L control . Full performance for all models are demonstrated in Table 9, Table 11, Table 13 and Table 15. We also provide some generated samples from the test set for reference, demonstrated in Table 10, Table 12, Table 14, Table 16,  Table 17.

F Full performances of Semantic Label Control under Different Number of Training Samples
For the semantic label control task, we find the optimal hyper-parameter λ for L control differs across different number of training samples. Full performance for all models are demonstrated in Table 18,  Table 20, Table 22 and Table 24. We also provide some generated samples from the test set for reference, demonstrated in Table 19, Table 21, Table 23,    "my favorite season is winter." "hi how are you doing that sounds great ." "i won a gold medal in the 1992 olympics." "it is , how old are you ? i'm 31 , feeling like i'm getting old lol." "i love cats and have two cats." "not old you are still young try 50 i'm getting old ." "i've a hat collection of over 1000 hats." "lol my bones suggest other wise . do you have any animals ?" "i have a weakness for vanilla ice cream." Ground Truth: "yes i like and love all animals i have two cats." DialoGPT-Ori: "my dogs are my favorite, but i'm pretty fond of all dogs so i don't have many." DialoGPT-FT: "i love my three dogs. " DialoGPT-FUDGE: "My bones say I'm young too, so I think we're good there for now." DialoGPT-PPLM: "I think i'm only 35, but i've been in a pretty good relationship so i've probably gotten better." DialoGPT-SideControl: "!"  "my favorite season is winter." "hi how are you doing that sounds great ." "i won a gold medal in the 1992 olympics." "it is , how old are you ? i'm 31 , feeling like i'm getting old lol." "i love cats and have two cats." "not old you are still young try 50 i'm getting old ." "i've a hat collection of over 1000 hats." "lol my bones suggest other wise . do you have any animals ?" "i have a weakness for vanilla ice cream." Ground Truth: "yes i like and love all animals i have two cats." DialoGPT-Ori: "my dogs are my favorite, but i'm pretty fond of all dogs so i don't have many." DialoGPT-FT: "cats and dogs. you?" DialoGPT-FUDGE: "My bones say I'm young too, so I think we're good there for now." DialoGPT-PPLM: "I think i'm only 35, but i've been in a pretty good relationship so i've probably gotten better." DialoGPT-SideControl: "i don't, they don't get to play."  "my favorite season is winter." "hi how are you doing that sounds great ." "i won a gold medal in the 1992 olympics." "it is , how old are you ? i'm 31 , feeling like i'm getting old lol." "i love cats and have two cats." "not old you are still young try 50 i'm getting old ." "i've a hat collection of over 1000 hats." "lol my bones suggest other wise . do you have any animals ?" "i have a weakness for vanilla ice cream." Ground Truth: "yes i like and love all animals i have two cats." DialoGPT-Ori: "my dogs are my favorite, but i'm pretty fond of all dogs so i don't have many." DialoGPT-FT: "!!! i've 3 cats but they are all scared of me haha." DialoGPT-FUDGE: "My bones say I'm young too, so I think we're good there for now." DialoGPT-PPLM: "I think i'm only 35, but i've been in a pretty good relationship so i've probably gotten better." DialoGPT-SideControl: "i do. i love them. what do you do for fun? goldfish?"    Dialogue Context: Dialogue Act: "Well , have you ever heard of Oulong tea and compressed tea ?" "directive" "Oh , yeah . Oulong tea is good for one's health . isn't it ?" "You surely know a lot about Chinese tea." "Sure , I like drinking tea at teahouses." "Oh , so do I." Ground Truth: "Why don't we go for one now ?" DialoGPT-Ori: "I'm sure." DialoGPT-FT: "yes. I have a large appetite. I like eating. I like cooking. I like shopping. I like to go to the markets. " DialoGPT-FUDGE: "I'm not sure if you're being serious or not, but I'm pretty sure that's not what he's saying." DialoGPT-PPLM: "I know that tea from somewhere, so it must be Chinese." DialoGPT-SideControl: "You know your way, I can't be sure."  Table 20: Semantic label control performances under 1000 training samples sampled from DailyDialog, where λ = 10 3 for L control in DialoGPT-SideControl.
Dialogue Context: Dialogue Act: "Well , have you ever heard of Oulong tea and compressed tea ?" "directive" "Oh , yeah . Oulong tea is good for one's health . isn't it ?" "You surely know a lot about Chinese tea." "Sure , I like drinking tea at teahouses." "Oh , so do I." Ground Truth: "Why don't we go for one now ?" DialoGPT-Ori: "I'm sure." DialoGPT-FT: "But do you like Chinese tea better than American tea?" DialoGPT-FUDGE: "I'm not sure if you're being serious or not, but I'm pretty sure that's not what he's saying." DialoGPT-PPLM: "I think it's the Chinese version that's for me" DialoGPT-SideControl: "You are the second person to make my point!"  Table 22: Semantic label control performances under 5000 training samples sampled from DailyDialog, where λ = 10 4 for L control in DialoGPT-SideControl.
Dialogue Context: Dialogue Act: "Well , have you ever heard of Oulong tea and compressed tea ?" "directive" "Oh , yeah . Oulong tea is good for one's health . isn't it ?" "You surely know a lot about Chinese tea." "Sure , I like drinking tea at teahouses." "Oh , so do I." Ground Truth: "Why don't we go for one now ?" DialoGPT-Ori: "I'm sure." DialoGPT-FT: "you like the Chinese traditional silk?" DialoGPT-FUDGE: "I'm not sure if you're being serious or not, but I'm pretty sure that's not what he's saying." DialoGPT-PPLM: "It's not tea if you mix tea and tea together" DialoGPT-SideControl: "That's right. We should go to a tea party."