Personalized Response Generation with Tensor Factorization

Personalized response generation is essential for more human-like conversations. However, how to model user personalization information with no explicit user persona descriptions or demographics still remains under-investigated. To tackle the data sparsity problem and the huge number of users, we utilize tensor factorization to model users’ personalization information with their posting histories. Specifically, we introduce the personalized response embedding for all question-user pairs and form them into a three-mode tensor, decomposed by Tucker decomposition. The personalized response embedding is fed to either the decoder of an LSTM-based Seq2Seq model or a transformer language model to help generate more personalized responses. To evaluate how personalized the generated responses are, we further propose a novel ranking-based metric called Per-Hits@k which measures how likely are the generated responses come from the corresponding users. Results on a large-scale conversation dataset show that our proposed tensor factorization based models generate more personalized and higher quality responses compared to baselines.


Introduction
Building human-like conversational systems has received much attention in artificial intelligence communities, and personalized response generation is one essential step towards this goal, as more personalized responses are often associated with increased user engagement (Shum et al., 2018;. To this end, we focus on the task of personalized response generation in this work, and argue that incorporating personalization into text generation can benefit many down-stream applications such as social chit-chat chatbots (Zhang et al., 2018) and auto-complete responses like Smart Replies (Kannan et al., 2016).
Prior text generation work on modeling personalization mainly relied on explicitly given persona or demographic information. For instance, (Zhang et al., 2018;Wolf et al., 2019;Xu et al., 2020) utilized a set of persona sentences to profile users, and other line of research leveraged demographics to model user personalization (Zheng et al., 2019(Zheng et al., , 2020. Despite its effectiveness, such approaches are limited when it comes to real world scenarios. First, explicit persona or demographic information is often not available. Second, collecting such personalization information is usually costly and time-consuming, which also suffers from either artificially designed persona descriptions from third-party annotators or subjective and unreliable self-reports from users themselves (Stone et al., 1999). Although such explicit personalization information is often unavailable, content that users produce is generally ubiquitous and can indicate their preferences, personal information, styles, and knowledge in a relatively implicit but objective manner. Our work thus utilizes these posts and comments users made to learn latent representations of their personalization information.
Different generation models have been designed to learn user personalization information and further impose such representation on text generation. For instance,  proposed the Speaker model based on Seq2Seq framework by introducing trainable speaker embedding for each user and feeding it to decoder at each step of decoding. However, there are always a large number of distinct users and users often participate in only a few conversations; as a result, the speaker embedding may be under-fitted given the limited data points associated with a user. Another line of research uses generative memory network (Zhang et al., 2018), which first retrieves some most relevant responses to a user's input as the memory and then encodes them into an embedding. The difference between the embedding from memory network and speaker embedding is that the former encodes both information of question and user, while the latter represents only users. Nevertheless, the set of observable question-user pairs and their responses is still a small subset of the whole user and question sets, leading to the sparsity issue.
Matrix Factorization (MF) has been widely used to infer latent relationships between users and items in recommender systems, especially for data sparsity issues (kumar Bokde et al., 2015). Motivated by this, we propose to model latent interactions between questions and users by looking at who participated in which conversations, and infer user personalization information from data automatically, for personalized response generation tasks. Differently, as the score or rating used in recommender system usually denotes users' preferences towards items, such scalar is not enough to represent the semantic meaning of a response. Thus, we introduce a response vector to indicate the response content that a user will make for a given conversation, i.e., personalized response embedding, resulting in a tensor form representation for all question-user pairs. Decomposing this tensor (tensor factorization, TF) will lead to the factorized representations for each user, question, and dimension of the response embedding. We propose to augment response generation models with such TFinduced modules, which are model-agnostic and can be applied to many different generation models. Specifically, we introduce a TF module based framework on top of LSTM-based Seq2Seq model and transformer language model for personalized response generation, and further train them together in an end-to-end fashion. Evaluating response generation usually considers content relatedness and language quality to ensure that generated text is grammatically correct and fluent, using BLEU and Perplexity. However, evaluating personalization in personalized response generation is relatively challenging as there lacks effective metrics.
To this end, we propose a novel evaluation metric Per-Hits@k to model personalization , which for the response of a user first calculates its perplexity values via language models of all users, and then ranks the perplexity via this user's language model to examine whether it is ranked as top-k, based on a pre-trained GPT-2 language model (Radford et al., 2019) for each user. Our contributions are: • propose a tensor factorization based framework to model personalization for response generation task; • introduce a metric Per-Hits@k, to evaluate the personalization of the generated responses; • experimental results on a large-scale personalized Reddit dataset show that our TF-based framework outperforms previous methods significantly in terms of both content generation quality and personalization.

Related Work
Personalized Response Generation Personalization has received much attention in the natural language processing community, such as personalized image captioning (Chunseong Park et al., 2017), personalized machine translation (Rabinovich et al., 2017), personalized response generation , personalized intent classification and personalized slot tagging (Liu et al., 2016). Prior studies formulate the task of response generation as generating an output given an input text, mainly based on either the sequence-to-sequence (Seq2Seq) models (Vinyals and Le, 2015) or the pretrained models like GPT-2 (Radford et al., 2019) and BART (Lewis et al., 2019). When it comes to personalized response generation, Speaker model  extended traditional response generation models by assigning each user with a trainable speaker ID embedding. Another line of research focuses on leveraging persona descriptions or demographic attributes (Zheng et al., 2020;Qian et al.;Wolf et al., 2019;Luo et al., 2019), building on recent personalized dialogue datasets such as PERSONA-CHAT (Zhang et al., 2018) and Per-sonalDialog (Zheng et al., 2019). For instance, Xu et al. (2020) utilized the predefined user persona description together with their semantically correlated content for generating personalized responses in dialogue systems. Different learning paradigms have also been introduced for personalized response generation such as reinforcement learning (Mo et al., 2016;Yang et al., 2018;Xu et al., 2020) and transfer learning to benefit from a source domain with sufficient training data (Yang et al., 2017). However, most aforementioned approaches require explicit persona or demographic information which is often unavailable in real world scenarios. To fill this gap, we propose to learn latent representation of personalized user information from users' posts and model personalization jointly together with traditional generation methods for personalized response generation.
Evaluation Metrics for Personalized Response Generation Current automatic evaluation metrics for response generation can be broadly categorized into three classes. (1) Content relatedness measures how related a generated response is with its corresponding ground-truth, with representative metrics such as BLEU (Papineni et al., 2002), NIST (Doddington, 2002), and METEOR (Lavie and Agarwal, 2007). Speaker sensitive responses evaluation model (SSREM) (Bak and Oh, 2020) enhances the relatedness score with a context-response classifier.
(2) Language quality mainly refers to the fluency and diversity, where the former is measured via perplexity (Chen et al., 1998) and the latter is assessed via distinct diversity (Li et al., 2015; that indicates how diverse the generated responses are. (3) Style adherence aims to evaluate the adherence of the generated responses' language style to the user's own language style; example metrics include the average negative log-likelihood (NLL) of one poet's generated lyrics on it's poet specific language model (Vechtomova et al., 2018), stylistic alignment (Syed et al., 2020) that looks at the language style alignment at the surface, lexical and syntactic level, and Hits@1/N (Dinan et al., 2019) that measures how accurate the generated response can be classified to its corresponding user by a classifier. Our proposed Per-Hits@k metric thus belongs to the style adherence class, a more fine-grained metric compared to the average NLL metric (Vechtomova et al., 2018).

Tucker Decomposition
To learn latent association between users, questions and responses for personalized response generation, we choose Tucker decomposition, one widely used tensor factorization algorithm. Tucker decomposition (Tucker, 1966) decomposes a given 3-mode tensor X ∈ R I×J×K into a core tensor G ∈ R R 1 ×R 2 ×R 3 and three factor matrices Here, × i denotes the mode-i product of a tensor by a matrix (i ∈ {1, 2, 3}). Any element X (i,j,k) in X can be approximated by:

LSTM-based Seq2Seq Model
LSTM-based Seq2Seq model consists of an encoder LSTM, a decoder LSTM, and attention mechanism (Yao et al., 2015). Suppose the source text is S = (x 1 , x 2 , . . . , x m ) and the target text is T = (x m+1 , x m+2 , . . . , x N ), the encoder LSTM first encodes S into hidden vector h e m and cell vector c e m , then the decoder LSTM has its initial hidden vector h d 0 and cell vector c d 0 as: The hidden vector of decoder at time step t is: where g is the LSTM cell operation and y * t is the embedding of the input token at time step t.
Standard Seq2Seq models are not personalized, because there is no mechanism to incorporate userspecific information into their input. Speaker Model  alleviates this by explicitly concatenating a trainable speaker embedding v j to y * t for user j. Therefore, the hidden vector of decoder of Speaker model at time step t is:

Transformer Language Model
DialoGPT ) is a pre-trained conversational response generation model. Based on the architecture of GPT-2 (Radford et al., 2019), DialoGPT is trained on 147M Reddit discussions. For a question-user pair (i, j) with source input S and target response T , DialogGPT generates responses by modeling the conditional probability:

Method
We formulate the task of personalized response generation as follows: given a set of question-user pair (q, u) ∈ S q × S u where S q and S u refer to the question set and user set respectively, generate a response r for this question-user pair (q, u), i.e., posted by user u for question q. The overall model architecture is described in Figure 1.

Tensor Factorization Module
To enable personalized response generation, we first need to automatically infer personalized signals that users demonstrate in their participation such as questions that they might interact with, as such signatures are often not explicitly available. To this end, we introduce personalized response embedding p i,j , a K-dimensional vector, to represent the latent relationship between a question i and a user j. We then form a tensor using all p i,j over all question-user pairs and factorize this tensor, to learn latent interactions between questions, users, and their responses.
Formally, for a dataset with I = |S q | questions and J = |S u | users, we have a tensor P ∈ R I×J×K where P (i,j,:) = p i,j denotes each (i, j) pair. The notation P (i,j,:) refers to the mode-3 fiber (or tube) of the tensor P. P can be further formulated via Tucker Decomposition as follows: are the factor matrices, and G ∈ R R 1 ×R 2 ×R 3 is a core tensor. Once these factor matrices and core tensor are determined, the personalized response embedding p i,j for any question-user pair (i, j) can be calculated as: where q i and u j denote i-th and j-th row vector of Q and U respectively. ⊗ is the Kronecker product of two matrices.
Next, we introduce different mechanisms to incorporate TF modules especially p i,j into traditional LSTM-based models and Transformer Language Models. This is essential to train better TF modules since it is impossible to directly supervise p i,j as no ground truth is available.

LSTM-based Model with TF Module
To utilize TF module for standard LSTM-based Seq2Seq models, we propose to incorporate p i,j into the initial hidden vector and cell vector of the LSTM decoder to help generate more personalized response, as personalized response embedding p i,j is expected to also encode the target response: Here λ is a coefficient to balance the information from the LSTM encoder and the personalized response embedding. Note that our TF module is agnostic to encoder-decoder frameworks, and can be applied to any Seq2Seq model similarly, including but not limited to Seq2Seq, Speaker model , Seq2Seq with memory network (Zhang et al., 2018), and Speaker model with memory network. Figure 1 describes how the TF module is integrated with an LSTM-based Seq2Seq model. The TF module is randomly initialized and trained together with the Seq2Seq model. This allows TF module to access the supervision from the output response, thus learn the latent interaction between users and questions and produce personalized response embedding for the decoder.

Transformer with TF Module
Recent success of DialoGPT  on conversational response generation shows the potential of (pre-trained) transformer language model for the task of response generation. Thus we propose to incorporate TF module with transformer language model, (DialoGPT in specific) for personalized response generation. Since DialoGPT is a language model rather than a Seq2Seq model, it does not have a encoder-decoder architecture but only one transformer model. Thus we cannot utilize p i,j as the initial hidden vector for decoder like that in Eq. 1. Instead, we propose to add personalized response embedding p i,j with the input token embedding, token type embedding and positional embedding together as the input embedding to DialoGPT model. As shown in Figure 2, the personalized response embedding p i,j is added to token "<EOS>", "klein" and "bleu" in the input to decode the j-th user's response for the i-th question. The TF module that produces p i,j is also trained together with the DialoGPT model in an end-to-end fashion.

Dataset
To study the task of personalized response generation with no explicit personalization information, we used a personalized Reddit dataset PER-CHAT, consisting of 200,156 responses that users posted to different questions, from r/AskReddit 1 (Wu et al., 2021). Building upon Wu et al. (2021), we used active users who joined more than average discussions, and popular questions that received more comments. This led to 4724 users under 39,187 questions. These users and questions were sampled because they were active users who joined more discussions or popular questions that received more comments. We filtered all forms of url links, emails and digits into unique tokens "url", "email" and "digit". Replicated words and punctuation were processed to their standard forms. We sampled 3 responses for each user for users in the validation and test set, and the rest are used for training.  Figure 2: Input representation for DialoGPT model with TF module. TF module's personalized response embedding pi,j is added with response token's word embedding, token type embedding and positional embedding.

Baselines and Our Models
We introduced several baselines for comparison with our proposed models. We introduced several baselines for comparison with our proposed models. (1) DialoGPT: A response generation model based on DialoGPT-medium provided in Zhang et al. 2019; (2) Seq2Seq: A standard Seq2Seq model with attention mechanisms with no personalization information; (3) Speaker model: Our implementation of the speaker model . Following (Kottur et al., 2017), the Speaker embeddings were not initialized randomly but set as the average sentence embeddings from a user's all historical responses via sentence-BERT (Reimers and Gurevych, 2020); the dimension was reduced to 30 by principal component analysis. (4) Memory network: Our implementation of the generative memory network (Zhang et al., 2018) based on our Seq2Seq model with attention. We retrieved top-10 most relevant responses from a user for each question as the memory in the memory network; (5) Memory+Speaker: The generative memory network (Zhang et al., 2018), together with the use of the speaker embedding . Our models were based on the aforementioned baseline models by further incorporating our proposed TF module, i.e., the personalized response embedding from the TF module. Di-aloGPT+TF is a DialoGPT model with personalized response embedding added to each time step at the decoding stage shown in Figure 2. Seq2Seq+TF, Speaker+TF, Memory+TF, Mem-ory+Speaker+TF are constructed on top of our baseline models with personalized response embedding added to the decoder as Eq. 1.

Evaluation Metrics
We evaluated different models with F1, BLEU, Distinct-N, perplexity (PPL), and our proposed Per-Hits@k. Here, F1 (Dinan et al., 2019) refers to the harmonic mean of precision and recall computed based on the tokens between generated and ground truth response. BLEU (Papineni et al., 2002) was first proposed for machine translation but is also widely used for evaluating response generation. Distinct-N (Li et al., 2015) aims to evaluate lexical diversity and we tested distinct unigrams (Distinct-1) and bigrams (Distinct-2). We used perplexity to evaluate the fluency of the generation model.

Per-Hits@k for Personalization Evaluation
To evaluate the personalization in generated responses for a user, one needs to have a good understanding of that particular user who might sometimes have a very long posting history (500 responses per user on average in our dataset), making it hard for annotators to evaluate how personalized the generated response is for a user. Besides, not every response from a user will reveal their personalization information. Thus, we propose an automatic evaluation metric to evaluate the personalization degree of different generation models called Per-Hits@k. Suppose we have N users and there are M i responses generated for user i to be evaluated. We firstly train a user-specific language model LM i for each user i on all their responses in training set. We then test the j-th response's perplexity of user i on all users' language models, and denote its perplexity on user-n's language model as ppl n i,j . We rank the perplexity of user i's j-th response over N user language models (the lower the perplexity, the higher rank), and denote the ranking of the perplexity on user i's language model LM i with rank(ppl i i,j ). We define the value of Per-Hits@k in Per-Hits@k metric as: This measures how likely the generated response will be ranked as top-k with its corresponding user language model among N users. In our implementation, we fine-tuned GPT-2 (small) (Radford et al., 2019) for each user i to instantiate this user i's language model LM i . To ensure the quality of LM i , we only consider a subset of users (N = 500) and choose these users who have the most responses.

Implementation Details
We implemented our models with PyTorch (Paszke et al., 2019). For TF module, the core tensor is of size 50 × 50 × 50, dimension of personalized response embedding is 512 for all Seq2Seqbased models with TF module (denote as Seq2Seq-based+TF), while it is 1024 for the DialoGPT+TF model. For any Seq2Seq-based+TF model, both encoder and decoder have 2 LSTM layers with hidden size of 512, while DialoGPT+TF model is based on the pre-trained medium DialoGPT model with hidden size of 1024. Any word appears more than three times were included in the vocabulary of Seq2Seq-based+TF models, and the size of the vocabulary is 30K. DialoGPT+TF model uses the pre-trained Byte-Pair-Encoding (BPE) tokenizer of size 50,257. The λ coefficient in Eq. 1 is set to 0.2. Adam (Kingma and Ba, 2014) is used as the optimizer and the learning rate was set to 1e-3 for TF-Speaker model and 1e-5 for TF-DialoGPT by grid search. Top-k (k = 2) sampling (Fan et al., 2018) was used without any re-scoring techniques to generate response at test stage. We selected models with the highest average Per-Hits@k (k = 1, 2, 3, 4, 5) on validation set.

Results
As shown in Table 1, we reported F1, BLEU, Distinct-N and Per-Hits@k on test data. Distinct-N and Per-Hits@k on ground truth test data and Per-Hits@k on random ranking were also reported. Overall, we found that TF based models significantly improved the personalization metric Per-Hits@k compared to all baselines, with comparable and even better performances in terms of other metrics. Specifically, our proposed Seq2Seq+TF model had an average hist@k score 4 times higher than the Seq2Seq baseline and the Memory+Speaker+TF model had the highest personalization score. This demonstrates that our proposed TF module can model users' personalization well using users' posting history. Furthermore, 1) Per-Hits@k on ground truth data was far below its upper bound 100% but still much higher than Per-Hits@k of generation models, showing the effectiveness of our Per-Hits@k metric to evaluate user personalization. For example, a Per-Hits@1 score of 9.47% indicated that 9.47% of the ground truth responses were ranked as top-1 by its users' language model over  Per-Hits@k and paired t-test was performed for other metrics, the significant ones (p < 0.05) over its baseline are marked as * .
the 500 users. One explanation why Per-Hits@1 on ground truth data was far below 100% might be that these responses from a user do not necessarily always reveal their persona. 2) Although both Seq2Seq and DialoGPT did not model user personalization explicitly, they had higher than random Per-Hist@k.  (Zheng et al., 2020;.
Since we have relatively high Per-Hits@k on the ground truth test set, we hypothesize that those top ranked responses in the ground truth test set by Per-Hits@k might be more likely to contain user personalization information. In other words, for certain question-user pairs, a user is more likely to respond with some personalized content that could be better recognized by their language model. We    user-questioninteractions. When the rank reaches around 50, there seems to be limited averaged gains on Per-Hits@k. Thus, we chose core tensor of shape 50 × 50 × 50 for our TF module.
The Balancer λ We then studied the influence of the λ coefficient in Eq 1 which is used to balance the question information from the encoder and personalized response embedding from the TF module. We varied Seq2Seq+TF model's λ from 0 to 1, as shown in Figure 3(b). Note that Seq2Seq+TF with λ = 0 is the Seq2Seq baseline. We observed that Per-Hits@k increased a lot when λ changed from 0 to 0.1, confirming the effectiveness of our proposed TF module in modeling user personalization. Moreover, TF module was not sensitive to the hyper-parameter λ as Per-Hits@k were stable for λ ∈ [0.1, 0.4]. Per-Hits@k decreased when λ was larger than 0.4, suggesting the importance to balance the encoder and TF module.
User Factor Matrix To examine whether the TF module has learned user personalization information in user factor matrix U, we trained a Speaker model that initialized the speaker embeddings with user embeddings in U and other initialization methods. Specifically we studied the user factor matrix (TF-u) from the Seq2Seq+TF model in Table 1  (Random) and 2) average sentence embeddings of each user's historical responses (History) which is used in our Speaker model baseline; 3) we further concatenated the history embeddings and our user embeddings in U to be the initial Speaker embeddings (History+TF-u). The results of the four variants of Speaker model are shown in Table 3. We found that both History and TF-u initialization improved Per-Hits@k over Random to some extent, suggesting that our TF module has learned some degree of user personalization in its user factor matrix U. Although TF-u had smaller Per-Hits@k improvement over Random, History+TF-u has the best Per-Hits@k, indicating that the personalization information learned by TF module is different to that from users' posting history.

Robustness of Personalization Metric
To test the robustness of our Per-Hits@k metric, we trained trigram language models with the KenLM toolkit (Heafield et al., 2013) for the user specific language models used in Per-Hits@k. While GPT-2 is a transformer-based language model pretrained on large corpus and can be fine-tuned on each user's corpus, KenLM is impossible to follow this approach because it can only be trained in an end-to-end way, i.e. language models of KenLM is directly trained on each user's corpus. Thus we had two Per-Hits@k variants: Per-Hits@k-GPT2 (the one we used in previous sections) and Per-Hits@k-KenLM. We evaluated Per-Hits@k-GPT2 and Per-Hits@k-KenLM for all the models we trained with different settings and plot all (Per-Hits@k-KenLM, Per-Hits@k-GPT2) pairs for k ∈ {1, 2, 3, 4, 5} in Figure 4. With a correlation of 0.941 between two variants, we conclude that Per-Hits@k is robust because it produces consistent and similar judgements regardless of which language model it uses.

Conclusion and Discussion
This work proposed a tensor factorization module to model user personalization from users' posting history for the task of personalized response generation, where explicit persona or demographic information is unavailable. To automatically evaluate the personalization of generated response, we proposed a new evaluation metric called Per-Hits@k. Extensive experiments on a large-scale dataset show that our proposed TF module outperforms previous methods significantly in terms of its content generation quality and also the personalization of generated responses. Our ablation studies further demonstrated the effectiveness and robustness of our TF based generation framework.
One limitation to note for our work is that our tensor factorization based framework to model personalization has only been tested on a corpus derived from Reddit (Wu et al., 2021). We acknowledge that potential user population bias might be introduced in this process. Another limitation of our results lies in dealing with new users, i.e., the cold start problem. Future research could further examine these issues, build upon our work to examine how different types of implicit information such as social knowledge and commonsense might be learned together with these user profiles in this tensor factorization manner, and model personalization in multi-turn dialogue systems.