Modulating Language Models with Emotions

Generating context-aware language that embodies diverse emotions is an important step towards building empathetic NLP systems. In this paper, we propose a formulation of modulated layer normalization -- a technique inspired by computer vision -- that allows us to use large-scale language models for emotional response generation. In automatic and human evaluation on the MojiTalk dataset, our proposed modulated layer normalization method outperforms prior baseline methods while maintaining diversity, fluency, and coherence. Our method also obtains competitive performance even when using only 10% of the available training data.


Introduction
Building interactive systems that can understand and express human emotions has been a long-term goal of artificial intelligence (Shen and Feng, 2020;Salovey and Sluyter, 1997). Given a context, an intelligent agent ought to be able to generate responses that not only consider the context but also reflect a specified emotion, a task called emotional response generation. One common representation of emotions is through emojis, which often convey the underlying emotions in an utterance (Zhou and Wang, 2018). Table  1 shows an example generation in this formulation.
To tackle this problem, prior work has proposed a number of different models, including variants of sequence-to-sequence (Seq2Seq) models (Serban et al., 2016;Li et al., 2016a), variational autoencoders (VAE) (Gu et al., 2019;Shen et al., 2017; and adversarial networks (Kong et al., 2019;. Their generated responses are often dull or generic, partially due to the limited training data for diverse emotions . More recent studies have tried to

Context:
Emotion Response good game start morning off tigers v eagles.
good luck to all the eagles i m not a tigers fan but we ve got a win we ve got to wait for tommorrow for the game hope you enjoyed the match with your team Table 1. Example generation of our method for four different emojis. Context is an actual random tweet, and emotion is specified by emojis.
pre-train language models (LMs) on specific domain data to pivot generation towards certain direction Keskar et al., 2019). However, training a LM from scratch can be costly, and collecting sufficient pre-training data in diverse emotions is also challenging, especially for low-resource emotions (Yang et al., 2019a).
In this work, we present a simple and easyto-deploy technique that can enable pre-trained large-scale LMs to generate fine-grained emotional responses. Specifically, we inject emotional signals specified by 64 commonly used emojis via Modulated Layer Normalization (Mod-LN), a technique widely adopted in computer vision but whose potential has not been well studied yet in NLP. The main advantages of our method are: • Instead of designing or re-training models from scratch, our method is plug-and-play. In this work, we show its effectiveness on BERT (2019) and GPT-2 (2019), but one can easily extend our method to other Transformer-based LMs.
• By fully exploiting the transfer learning ability of pre-trained LMs, we achieve comparable emotional response generation performance as prior best-performing work with only 10% of the training data, which is especially beneficial for low-resource scenarios.
arXiv:2108.07886v1 [cs.CL] 17 Aug 2021 Given a context text and a specified emoji as a target emotion, we aim to generate responses that both reflect the emotion associated with the emoji and the semantic information in the context. In this work, we demonstrate how to inject target emotions through a modulation module of layer normalization ( §2.1). We also provide data preparation and model adaptation strategies on two typical LMs (BERT and GPT-2) to aid reproduction and extension ( §2.2).

Modulated Layer Normalization
Layerwise-normalization (LN) is commonly used in Transformer-based (Vaswani et al., 2017) language models (LMs) (Devlin et al., 2019;Radford et al., 2019;Yang et al., 2019b) to stabilize hidden state dynamics and reduce training time (Ba et al., 2016). In the vanilla implementation (Figure 1(a)), data are normalized by their own mean µ and standard deviation σ without relying on external inputs. In contrast to vanilla LN that only regularizes data itself, Mod-LN introduces an external modulation module shared across the whole dataset, which is independent of the individual data samples and able to modulate the regularization towards external inputs c (Figure 1 (b)). Specifically, for an input hidden state tensor x in layer l, it is normalized by Mod-LN as where is the smoothing parameter to avoid dividing by zero. MLP (l) γ and MLP (l) β are two trainable modulation modules for a certain layer l. They are computed by where W (l,1) and W (l,2) are dense layers belonging to layer l, with weights size of [64, 1 2 · dim h ] and [ 1 2 · dim h , dim h ] respectively 1 . Dense layers connect 64 emoji classes to the output hidden states of the language model, and b is a bias added to γ. We use the Swish activation (Ramachandran et al., 2017), which has been shown to outperform ReLU (Xu et al., 2015) on several challenging datasets. Though conceptually simple, such MLP based modules have been shown to be a faster and more efficient alternative to vanilla dot product selfattention in NLP (Tay et al., 2021) and CV (Tolstikhin et al., 2021). Our work uses MLPs as a plug-and-play modulator rather than a replacement for self-attentions, allowing us to shift the hidden states towards a given target emotion.

Data Preparation and Model Adaptation
For the text input, we concatenate ground-truth context with corresponding response as a whole input to feed into LMs. We add a pre-defined separator token ([SEP] for BERT and [UNK] for GPT-2) between context and response, to make LMs aware of the range of each part. We also pad both context and response to a max sequence length with the padding token.
Encoder-Decoder models have been successful in many text-to-text generation tasks, such as question answering (Chen et al., 2017;Seo et al., 2017), news summarization (Chopra et al., 2016;Rush et al., 2015), and style transfer (Li et al., 2018;Liu et al., 2021). For the response generation task, the encoder encodes the context text into a fixed-length vector in latent space, while the decoder decodes the generated response tokens step-by-step, given the encoded context vector and the ground truth token from the previous step; this method is also known as teacher-forcing (Zhang et al., 2019c;Cho et al., 2014).
In this work, we consider leveraging the transfer learning power of large-scale LMs-using LMs as encoder and decoder-to better capture the complicated relationship between context and response (Rothe et al., 2020). Auto-regressive LMs (ARLMs), such as GPT-2 are trained to iteratively predict the next step token given the past, while Masked Language Models (MLM), such as BERT, are trained to predict missing tokens given both the preceding and subsequent text. In contrast to the uni-directional attention flow in ARLM, the attention flow of MLM is bi-directional, and thus if we directly use MLM as decoder, the prediction of tokens in the response will also attend to (i.e., have the context of) future tokens; this could potentially lead to exposure bias (Schmidt, 2019). Inspired by recent text-to-text LMs such as T5 (Raffel et al., 2020) and BART , for MLM decoder, we modify the original bi-directional attention mask to make it uni-directional.
We experiment with two encoder-decoder models built on MLM and ARLM: 1) BERT-to-BERT: using bi-directional BERT as both encoder and decoder, but forcing the decoder BERT to attend to past context with uni-directional mask, and 2) GPT2-to-GPT2: using uni-directional GPT-2 as both encoder and decoder.

Experimental Setup
Dataset. For all the experiments, we use the Mo-jiTalk (Zhou and Wang, 2018) dataset, a large Twitter conversation corpus (N ≈ 700k) of responses that each contain one or more of 64 popular emojis. Following the original paper, we split the corpus into training, validation, and test sets of 596,959, 32,600, and 32,600 conversation pairs, respectively. We fine-tune the two LM-based encoder-decoder models on this dataset and generate responses given contexts and all possible emotions using top-k random decoding (Fan et al., 2018) on a machine with four RTX 2080 GPUs 2 .

Evaluation
Good emotional responses should accurately reflect the intended emotion, be diverse, and have coherent language. We thus evaluate three aspects of generated responses: emotion control ( §4.1), response diversity ( §4.2), and coherence and fluency ( §4.3). We also use Amazon Mechanical Turk (MTurk) to run a manual evaluation of emotion control and readability in generated responses ( §4.4).

Emotion Control
First, we evaluate whether intended emotions were reflected in the responses generated by various models. We choose DeepMoji (Felbo et al., 2017) 3 as the judgment classifier. DeepMoji was trained on a large-scale emoji dataset containing 1,246 million tweets and 64 distinct emojis, and as far as we know, is state-of-the-art for 64-emoji classification tasks. Since the meanings of different emojis can overlap with subtle differences, we compute Hits@k (k = {1, 3, 5}) classification accuracy  to describe the performance of models in different criteria. As shown in Table 1, our proposed models outperform R-CVAE with a large margin. Of note, LM-based models reveal more robust performance in extreme data scarcity cases: our models achieve comparable performance with R-CVAE even when using only 10% of the training data. Between BERT and GPT-2, GPT-2 shows superior performance, partially because its weights are from auto-regressive pre-training.

Generation Diversity
As shown in Table 2, we evaluate the diversity of responses generated by each model in terms of unigram and bigram type-token ratios, average length, and percent of stop words in generated responses, with values for the human-generated responses shown for reference. As measured by the type-token ratio for both uni-and bi-grams, our proposed models generate more diverse responses. In addition, compared with the R-CVAE, the responses generated by our models are longer and use fewer stop words. The advance can be attributed to the using of large-scale language models as base models.

Fluency and Coherence
Moreover, we evaluate the fluency and coherence of machine-generated text. For fluency, we trained a standalone language model on the humangenerated responses using KenLM (Heafield, 2011) to measure the perplexity of generated texts. To evaluate coherence between the context and the generated responses, we compute the similarity between the generated text and human-generated responses using BERTScore (Zhang et al., 2019b), with the human-generated responses as reference. We configure the BERTScore using 24-layer RoBERTa-large  as for English tasks. Table 3 shows these results. For perplexity and BERTScore, our Mod-LN models outperform the R-CVAE in both 10% and 100% training data cases.

Human Evaluation
In total 120 MTurk participants manually evaluated the emotion control and readability of responses from our proposed models and the original humangenerated reference data. The average age of participants was 38.40 years-old (SD = 12.26, Me-dian=34.50). More than half (65.8%) of participants were male, and 34.2% were female. The average completion time of each survey was 4.53 minutes. Participants were paid $1 per survey, averaging to more than $13 per hour wage for each participant, significantly above the U.S. federal minimum wage.
Procedure Each participant was assigned to read five randomly selected context-response pairs without being informed of the sources of the responses.
They were asked to rate 1) emotion control: "How well the emotion conveyed in the response agrees with the specified emoji? (1-very well to 7-not at all)", and 2) readability: "Please rate the readability of the response on a 7-point scale.
(1-very low to 7-very high)". The readability measure included five items adapted from a previous study (Graefe et al., 2018), specifically, well-written, concise, comprehensive, coherent, and clear. Since the five measures had very high agreement (Cronbach's 4 α = .91), we average the five measures into one as a general readability index.

Results
The participant's averaged ratings (µ) and Standard Errors (SE) are reported in Table 4.  Table 4: Humans manually evaluated the emotional control and readability of responses from the original data (human reference), Baseline and proposed models on a 7-point scale (1: low quality, 7: high quality). We also take the generative LM: vanilla GPT-2, as the ablation reference.
As shown in the table, the standard error of the mean among all annotators is .10, which is very low for a 7-point scale, indicating large agreement between annotators. Responses generated by Mod-LN MLM (BERT), Mod-LN ARLM (GPT-2), and the human-generated references had no statistically significant differences in emotion control and readability. All were rated significantly higher than plain GPT-2 and R-CVAE in both emotion control and readability (p < .001 for one-way repeated measures ANOVA). We also conducted pairwise multiple comparisons in our analysis as post-hoc analysis. In terms of emotion control, both of our two proposed models and original reference data were rated significantly better than vanilla GPT-2 (p < .007). For readability, both our models, vanilla GPT-2, and original reference data were rated significantly more readable than R-CVAE (p < .001).

Related Work
Emotional Text Generation. VAE-based models (Park et al., 2018;Shen et al., 2017;Serban et al., 2017), adversarial networks (Kong et al., 2019;Yu et al., 2017) and reinforcement learning systems (Li et al., , 2016b have dominated sentiment-aware dialogue models. Other methods have been developed using LSTM (Song et al., 2019) and GRU . All these methods, however, are built on relatively coarse emotion types, partially due to the limited modeling ability of RNNs. Our model outperforms current state-ofthe-art R-CVAE (Zhou and Wang, 2018) in the same 64-emoji settings.
Modulated Normalization. Though not common in NLP, modulated normalization has been previously used in computer vision. In addition to work mentioned in the introduction (De Vries et al., 2017), adversarial networks such as CGAN (Miyato and Koyama, 2018), self-attention GAN (Zhang et al., 2019a) and Style GAN (Karras et al., 2019) have used modulated normalization to inject external signal into their models. In NLP, previous studies have tried to modulate normalization for classification tasks (Houlsby et al., 2019) and multilingual machine translation (Bapna and Firat, 2019), however, both these methods require architecture-level modifications. Our method, on the other hand, is plug-and-play, requiring minimal modifications to the architecture and thus easier to deploy for a diverse set of applications.

Conclusions
We have proposed a modulated layer normalization approach to generating responses of varying specified emotions. Our approach allows us to leverage large pre-trained models, while remaining simple and easily-extendable. In empirical experiments, our approach substantially outperforms prior work and achieves comparable results using only 10% of the available training data, all while maintaining diversity, fluency, and coherence.