Personalized Transformer for Explainable Recommendation

Personalization of natural language generation plays a vital role in a large spectrum of tasks, such as explainable recommendation, review summarization and dialog systems. In these tasks, user and item IDs are important identifiers for personalization. Transformer, which is demonstrated with strong language modeling capability, however, is not personalized and fails to make use of the user and item IDs since the ID tokens are not even in the same semantic space as the words. To address this problem, we present a PErsonalized Transformer for Explainable Recommendation (PETER), on which we design a simple and effective learning objective that utilizes the IDs to predict the words in the target explanation, so as to endow the IDs with linguistic meanings and to achieve personalized Transformer. Besides generating explanations, PETER can also make recommendations, which makes it a unified model for the whole recommendation-explanation pipeline. Extensive experiments show that our small unpretrained model outperforms fine-tuned BERT on the generation task, in terms of both effectiveness and efficiency, which highlights the importance and the nice utility of our design.


Introduction
Recent years have witnessed the successful application of natural language generation. Many of the applications in fact require certain degree of personalization, such as explainable recommendation (Zhang et al., 2014;Li et al., 2020c;, review generation (Dong et al., 2017), review summarization , and conversational systems . In these tasks, user and item IDs that distinguish one user/item from the others are crucial to 1 https://github.com/lileipisces/PETER personalization. For example, in recommender systems, different users may care about different item features (e.g., style vs. quality), and different items may have different characteristics (e.g., fashionable vs. comfortable). The goal of explainable recommendation  is to provide an explanation to a user for a recommended item, so as to justify how the recommendation might match his/her interests. That is, given a pair of user ID and item ID, the system needs to generate an explanation, such as "the style of the jacket is fashionable" (see the last column of Table 4 for more examples).
Transformer (Vaswani et al., 2017), whose strong language modeling ability has been demonstrated on a variety of tasks (Radford et al., 2018;Devlin et al., 2019;Brown et al., 2020), however, is relatively under-explored for personalized natural language generation. Since IDs and words are in very different semantic spaces, it would be problematic to directly put them together for attention learning, because by doing so, the IDs are treated as words, but the IDs appear far less frequently than the words. For example, a paragraph of review (and thus hundreds of words) on e-commerce platform only corresponds to a single pair of user ID and item ID. As such, the IDs may be regarded as out-of-vocabulary tokens, to which the model is insensitive. As shown in Fig. 1(a), when generating an explanation for a user-item pair, standard Transformer relies heavily on the special <bos> token instead of the user or the item. This would result in identical explanations over different useritem pairs (see USR score in Table 2), deviating from our personalization goal.
To address this problem, we bridge IDs and words by designing an elegant task called context prediction, which maps IDs onto words to be generated by the explanation task. This in some way resembles one's drafting-polishing process, where by predicting some words the context prediction [User] [ Figure 1: Attention visualization of two models when generating an explanation for the same user-item pair (see the first two columns). They are both from the last attention layer, so the target sequences are offset by one position for better illustration. The larger the attention weights, the lighter the cells.
task does the job of drafting. Then, the explanation generation task polishes these words so as to form a readable sentence. Meanwhile, we demonstrate that conducting recommendation task on the same model is also feasible, so we name it PETER, which stands for PErsonalized Transformer for Explainable Recommendation. As we can see in Fig.  1(b), when PETER generates an explanation for the same user-item pair, it can utilize the information of both the user and the item, which illustrates the effectiveness of our context prediction task. In addition, PETER is flexible to incorporate item features that can help to guide its generation. This can be very useful when, for instance, a user proactively asks the system to explain certain feature(s) of a recommendation (Li et al., 2020c), e.g., price. Then, we would expect the model to generate a targeted explanation, such as "great jacket, especially for the price". PETER is a small unpretrained Transformer with only 2 layers, yet it outperforms a fine-tuned BERT (Ni et al., 2019) on most metrics by a large margin, and takes less time to train, as shown in our experiments. This manifests the superiority of our model.
In summary, our key contributions are: • We propose PETER that makes recommendation and generates explanation simultaneously based on user and item IDs for explainable recommendation. To the best of our knowledge, we are the first to enable Transformer with personalized natural language generation.
• We evaluate the generated explanations on not only text quality metrics (such as BLEU and ROUGE), but also metrics that particularly focus on explainability from the angle of item features. Extensive experiments show that our model can outperform state-of-the-art baselines on large datasets.
• Our solution sheds light on a broader scope of fields that also need personalization (e.g., personalized conversational systems). In addition, it points out a way for Transformer to deal with heterogeneous inputs, e.g., text and images in multimodal artificial intelligence.

Related Work
Explainable recommendation (Zhang et al., 2014; has been studied from two major perspectives: human-computer interaction and machine learning. The former (Gedikli et al., 2014;Chen and Wang, 2017;Chen et al., 2019b) investigates how people perceive different styles of explanations, while the latter provides explanations by designing new explainable recommendation algorithms, to which our work is more related. There exist various types of explanation styles, such as pre-defined templates (Zhang et al., 2014;Li et al., 2020a), ranked sentences (Chen et al., 2019d;, image visualizations (Chen et al., 2019c), knowledge graph paths Xian et al., 2019;, reasoning rules (Shi et al., 2020;Zhu et al., 2021), etc., among which, recently, generated natural language explanations (Ni et al., 2019;Li et al., 2020c) have received much attention, mainly owing to the advancement of natural language generation technology and the availability of textual data on recommendation platforms such as e-commerce. However, previous works mostly rely on recurrent neural networks (RNN), e.g., LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014), leaving the potentially more effective Transformer under-explored, which motivates this work.
Transformer (Vaswani et al., 2017) was first brought to machine translation with the architecture of encoder-decoder. Later works Devlin et al., 2019) show that it remains effective, even when the encoder or the decoder is removed, reducing nearly half of the parameters. Under the paradigm of pre-training plus finetuning, Transformer's effectiveness has been confirmed on a wide range of tasks, including both natural language understanding and generation (Radford et al., 2018;Devlin et al., 2019;Dong et al., 2019). Particularly, it is able to perform novel tasks, e.g., arithmetic, after scaling up both the model and the training data (Radford et al., 2019;Brown et al., 2020). However, it may not be friendly to researchers who do not possess large amounts of computing resources. Instead, our work explores small unpretrained models, as they are computationally cheaper and more flexible when being adapted to new applications, e.g., personalized generation.
Personalized generation usually involves the IDs of users and items. Previous approaches typically adopt multi-layer perceptron (MLP) to encode the IDs into a context vector, from which RNN can decode a word sequence. This strategy can be found in many applications, such as review generation (Dong et al., 2017), tip generation  and explanation generation (Li et al., 2020c). However, it does not fit Transformer that relies entirely on self-attention. Probably because a proper solution to deal with heterogeneous inputs (i.e., IDs and words) is yet to be found, previous works with Transformer for personalized generation replace IDs with text segments, such as persona attributes (Zheng et al., 2020), movie titles  and item features (Ni et al., 2019), which are in the same semantic space as the word sequence to be generated. In comparison, our solution is to design an effective task that can give the IDs linguistic

Problem Formulation
The goal of our explanation task is to generate a natural language sentenceÊ u,i for a pair of user u and item i to justify why i is recommended to u. Meanwhile, our model PETER can also make recommendations by estimating a ratingr u,i that predicts u's preference towards i. At the testing stage, only user u and item i are used as inputs for producing both explanation and recommendation. When item features F u,i are available, our model is flexible to incorporate them by simply concatenating them at the beginning of the explanation. In this case, the features are also needed in the testing stage. In the following, we will discuss both cases.

Methodology
In this section, we present the details of our model PETER. First, we show how to encode different types of tokens in a sequence. Then, we briefly review Transformer and introduce our revised attention masking matrix. At last, we formulate the three tasks, i.e., explanation generation, context prediction and recommendation, and integrate them into a multi-task learning framework.

Input Representation
We first introduce our way to encode heterogeneous inputs into vector representations. As shown in Fig.  2, the input to our model is a sequence, consisting Allow to attend Prevent from attending 1 2 Figure 3: The attention masking used in our model that we call PETER masking. The orange box highlights its difference from the Left-to-Right masking.
of user ID u, item ID i, features F u,i , and explanation E u,i . The user and the item serve for the purpose of personalization, i.e., aiming to make the generated explanation reflect both the user's interests and the item's attributes. The features can guide the model to talk about certain topics. For instance, a conversational recommender system may explain a recommendation's specialty to the user with the goal of knowing more about his/her preference . Since the features are not always available, in our experiments we test both cases (with and without them). When they are available, the input sequence can be represented as where f 1 , · · · , f |F u,i | are the features and e 1 , · · · , e |E u,i | are the explanation's word sequence. |F u,i | denotes the number of features and |E u,i | is the number of words in the explanation.
Clearly there are three types of tokens in the sequence S, i.e., users, items, and words (including features), for which we prepare three sets of randomly initialized token embeddings U, I and V respectively, besides the positional embeddings P that encode the position of each token in the sequence. Notice that, we do not add users and items to the vocabulary V, given that it costs more time to predict a word out of the huge amount of IDs (for example, millions of users and items in e-commerce). After performing embedding lookup, we can obtain the sequence's token representation [u, i, f 1 , · · · , f |F u,i | , e 1 , · · · , e |E u,i | ] and its positional representation [p 1 , · · · , p |S| ], where |S| is the length of the sequence. The input representation of the sequence is the addition of the corresponding token representation and positional representation, denoted as S 0 = [s 0,1 , · · · , s 0,|S| ].

Transformer and Attention Masking
To enable the three tasks, we show how to modify the attention masking mechanism in Transformer (Vaswani et al., 2017). Transformer consists of L identical layers, each of which is composed of two sub-layers: multi-head self-attention and positionwise feed-forward network. The l-th layer encodes the previous layer's output S l−1 into S l , where l ∈ [1, L]. In the multi-head self-attention sublayer, the computation of each attention head is also identical, and among the H heads of the l-th layer, the h-th head A l,h is computed as follows: Each element in M controls whether a token in the sequence can attend to another. For example, in the unidirectional left-to-right language model (Radford et al., 2018), the lower triangular part of M is set to 0 and the remaining part −∞, so as to allow each token to attend to past tokens (including itself), but prevent it from attending to future tokens. We call it Left-to-Right Masking. As our model is not limited to the left-to-right explanation generation task, we modify the masking mechanism to accommodate the other two tasks (i.e., context prediction and recommendation). As shown in Fig. 3, the first two tokens u and i in the sequence can attend to each other, because both context prediction and recommendation tasks need them. To echo our model, we name it PETER Masking.

Explanation and Recommendation
In the following, we perform the three tasks, after obtaining the sequence's final representation S L = [s L,1 , · · · , s L,|S| ] from Transformer. The key challenge lies in the personalization of explanation generation task, for which we design the context prediction task. For both tasks, we apply a linear layer to the final representation of each token to map it into a |V|-sized vector. As an example, after passing through this layer, s L,t becomes c t : where W v ∈ R |V|×d and b v ∈ R |V| are weight parameters. The vector c t represents the probability distribution over the vocabulary V, from which a word e with probability c e t can be sampled.
Explanation Generation: We adopt the Negative Log-Likelihood (NLL) as the explanation task's loss function, and compute the mean of useritem pairs in the training set: where T denotes the training set. The probability c et t is offset by 2 + |F u,i | positions because the explanation is placed at the end of the sequence, and |F u,i | = 0 when the features are unavailable.
At the testing stage, along with u, i, and F u,i (if available), we feed the model a special begin-ofsequence token <bos>. From its resulting probability distribution c <bos> , the model can predict a word. For simplicity, among the many decoding methods, we opt for greedy decoding that samples the word with the largest probability. Then we can concatenate this predicted word at the end of the sequence to form a new input sequence for generating another word. We do this repeatedly until the model produces a special end-of-sequence token <eos>, or the generated explanationÊ u,i reaches a pre-defined length.
Context Prediction: As discussed earlier, when there is only one task of explanation generation, Transformer fails to make use of user ID and item ID, resulting in identical sentences. To address this issue, we design this task to map the IDs onto the words in the explanation, so as to build a connection between them. Since the first two positions (u and i) of the sequence are allowed to attend to each other, both of their final representations absorb the information of the user and the item. Thus, we can use either of them to perform this task. Here, we use the 2nd one for better illustration in Fig. 2. Again, we adopt NLL as the loss function: where the difference from Eq. (3) is that all predicted words are from the 2nd position, which is why they are not sequentially ordered (see Fig. 2).
Rating Prediction: Recommendation can be seen as a prediction problem  where the goal is to predict a scorer u,i based on the IDs of user u and item i. As both u and i in the sequence can attend to each other, their final representations capture the interaction between them. Next, we map the 1st representation s L,1 into a scalar (because the 2nd one is used for context prediction). To this end, we employ multi-layer perceptron (MLP) with one hidden layer as follows: where W r ∈ R d×d , b r ∈ R d , w r ∈ R 1×d and b r ∈ R are weight parameters, and σ(·) is the sigmoid function. Therefore, it can be seen that it is feasible to do both recommendation and explanation on Transformer. As recommendation is not the key focus of this paper, we leave its improvement in the future work. For this task, we use Mean Square Error (MSE) as the loss function: where r u,i is the ground-truth rating.
Multi-task Learning: At last, we integrate the three tasks into a multi-task learning framework whose objective function is defined as: where Θ denotes all the trainable parameters in the model, and λ e , λ c and λ r are regularization weights that balance the learning of different tasks. In this way, the model can be trained efficiently in an end-to-end manner.

Datasets
For experimentation, we adopt three publicly available explainable recommendation datasets, and their data splits (Li et al., 2020c). During the splitting process, each dataset is randomly divided into training, validation and testing sets with ratio 8:1:1 for 5 times, and the training set holds at least one record for each user and each item. The three datasets are respectively from TripAdvisor  (hotel), Amazon (movies & TV) and Yelp (restaurant). Each record in the datasets is comprised of a user ID, an item ID, a rating, an explanation, and a feature. The explanations are sentences extracted from user reviews. Each explanation contains at least one item feature, e.g., bedroom, which ensures the explanation quality. Statistics of the datasets are shown in Table 1. We can see that Yelp is much larger than the other two in terms of size, making it closer to the real-world situation where there are millions of users and items.

Evaluation Metrics
To evaluate the recommendation performance, we adopt two commonly used metrics: Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). As to explanation performance, we measure the generated explanations from two main perspectives: text quality and explainability. For the former, we adopt BLEU (Papineni et al., 2002) in machine translation and ROUGE (Lin, 2004) in text summarization, and report BLEU-1 and BLEU-4, and Precision, Recall and F1 of ROUGE-1 and ROUGE-2. Though being widely used, BLUE and ROUGE are not flawless. For example, it is difficult for them to detect the problem of identical sentences generated by Transformer. These identical sentences might not be used as explanations, because they are less likely to well explain the special property of different recommendations. To quantitatively measure how severe the problem is, we adopt USR that computes the Unique Sentence Ratio of generated sentences (Li et al., 2020c).
Text quality, however, is not equal to explainbility. In the case of explainable recommendation, users may value more an explanation that justifies a recommendation's advantages on certain features (Li et al., 2020c;Chen et al., 2019a). To this end, we adopt the other three metrics proposed by (Li et al., 2020c): Feature Matching Ratio (FMR), Feature Coverage Ratio (FCR) and Feature Diversity (DIV). FMR measures whether a generated explanation contains the feature in the ground-truth. FCR is computed as the number of distinct features contained in all the generated explanations, divided by the total number of features in the whole dataset. DIV measures the intersection of features between any two generated explanations.
For RMSE, MAE and DIV, the lower, the better, while it is opposite for the rest of metrics.

Compared Methods
We introduce baselines, first for explanation and then for recommendation. For the former, we divide the baselines into two groups, depending on whether the feature is used or not.
The following models leverage only user and item IDs to generate explanations (without feature). We denote our model without feature as PETER.
• Transformer (Vaswani et al., 2017) performs the explanation generation task by treating user and item IDs as words. We also tested encoder-decoder Transformer, where the encoder encodes the IDs for the decoder to decode, but its results turned out to be the same, so we do not report it.
• NRT  can predict a rating and generate a tip simultaneously based on user and item IDs. We take the explanations in the datasets as tips. Moreover, we found that the model's problem of generating identical sentences (as reported in Li et al., 2020c) is caused by the L2 regularization in its original design. For fair comparison, we removed it.
• Att2Seq (Dong et al., 2017) is a review generation approach and we take the explanations as reviews. This model has an attention module, but we found that it makes the generated content unreadable in the task. To be fair, we removed it as well.
When features are used, we denote our model as PETER+, and compare it with two recent models: • ACMLM (Ni et al., 2019) is a fine-tuned BERT (Devlin et al., 2019), where an attention layer is introduced to encode the features from both the user and the item. By predicting masked tokens, this model can produce diverse sentences.
• NETE (Li et al., 2020c) is a tailored GRU (Cho et al., 2014) that incorporates a given  feature into the decoding process to generate template-like explanations. It can also make recommendations.
For recommendation, besides NRT and NETE, we include another two traditional methods: • PMF (Mnih and Salakhutdinov, 2007) is a standard probabilistic matrix factorization method that characterizes users and items by latent factors.

Implementation Details
We train each model on the training set, tune the hyper-parameters on the validation set, and report the performance on the testing set. The results are averaged on the 5 data splits. We adopt the codes of ACMLM and NETE, and implement all the other methods. For NRT, Att2Seq, NETE and our PE-TER and PETER+, we set the size of vocabulary to 20,000 by keeping the most frequent words. We do not apply this to Transformer, otherwise users  and items (regarded as words) may be filtered out. We set both the number of context words and the length of explanations to 15, because the mean length of explanations is approximately 13 (see Table 1). ACMLM adopts sub-words, so we do not apply the above two steps to it. We reuse the other default settings of the baselines. For Transformer, PETER and PETER+, we set the embedding size d to 512 and the dimension of feed-forward network to 2,048, following (Vaswani et al., 2017), but the number of layers L and attention heads H are both 2. For our models PETER and PETER+, we set the regularization weights λ e , λ c and λ r to 1.0, 1.0 and 0.1, respectively. We optimize the model via stochastic gradient descent (Robbins and Monro, 1951), and apply gradient clipping (Pascanu et al., 2013) with a threshold of  the rooms are spacious and the bathroom has a large tub PETER <eos> the and a pool was with nice is very were to good in of the pool area is nice and the gym is very well equipped <eos> PETER+ <eos> the and a was pool with to nice good very were is of in the rooms were clean and comfortable <eos> Ground-truth beautiful lobby and nice bar PETER <eos> the and a was were separate bathroom with shower large very had in is the bathroom was large and the shower was great <eos> PETER+ <eos> the and a was bathroom shower with large in separate were room very is the lobby was very nice and the rooms were very comfortable <eos> 6 Results and Analysis

Quantitative Analysis on Explanations
In Table 2, we compare the performance of explanation generation methods in two groups. We first analyze models that make use of item features (i.e., ACMLM, NETE and PETER+). Our PETER+ consistently and significantly outperforms ACMLM and NETE on the three datasets in terms of text quality (BLEU and ROUGE). This shows the effectiveness of our model in generating high-quality sentences. Notice that Li et al. (2020b) conducted a user survey and reported that NETE's explanations were perceived useful by most participants. It suggests that our model's explanations with better quality could also be very useful to real users. Again, in terms of text quality, the performance gap between PETER+ and ACMLM (a fine-tuned BERT) is extremely large, because the latter's generation is achieved by predicting masked tokens, which is quite different from word-by-word generation. This may explain why ACMLM produces diverse sentences (high USR), which, however, is less meaningful when text quality cannot be guaranteed. Furthermore, PETER+ beats both ACMLM and NETE on the explainability metric FMR that cares about whether a generated explanation mentions the feature in the ground-truth. This is quite useful in real-world applications when the system is asked to explain a particular feature. Regarding the other two explainability metrics FCR and DIV, PETER+ is also very competitive. ACMLM gains better performance on some cases, because at the training stage it is exposed to more features (from both the user and the item), which is unfair to both PETER+ and NETE.
Next, we discuss the results of the models that  only leverage user and item IDs for generation. As it can be seen, Transformer generates identical explanations on each dataset, resulting in nearly 0 score on Unique Sentence Ratio (USR). Owing to the context prediction task, our PETER successfully addresses this issue, producing diverse (comparable USR) and high-quality (best BLEU-4) sentences. In particular, on the largest dataset Yelp, it achieves the best performance on most of the metrics. This again demonstrates the effectiveness of our model. On Amazon and TripAdvisor, NRT and Att2Seq are very competitive, because we fixed their generation issues (see Section 5.3).
In addition, the two datasets are small and thus the training samples are limited, so our model may underfit, which is why it does not always reach the best performance. Besides explanation performance, we also investigate the efficiency of different Transformer-based models. On the same machine (NVIDIA Tesla P40) and dataset (TripAdvisor), we compare the training minutes of ACMLM and our PETER+ in Table  3. Compared with ACMLM, our model takes less time to train (2.3 minutes per epoch), since it has only 2 layers and thus less parameters. But because it is unpretrained and learned from scratch, it needs more training epochs.

Qualitative Case Study on Explanations
In  tic meanings, as well as achieving certain degree of personalization for natural language generation. Among the commonly used context words, e.g., the, there are some important features (underlined), according to which the model then generates an explanation that talks about them. Admittedly, there is still much room for improvement of the context prediction task, so as to more accurately predict the features in the ground-truth (e.g., rooms vs. pool in the first example). One alternative is to leverage the features to guide the model's generation. This explains why PETER+ is able to generate an explanation that talks about rooms rather than pool, making it semantically closer to the ground-truth. It thus demonstrates our model's flexibility in incorporating these features.

Recommendation Performance
Table 5 presents the performance comparison of different recommendation methods. On the largest dataset Yelp with approximately 1.3 million records, our model PETER performs as good as the three competitive baselines (i.e., SVD++, NRT and NETE), which shows the rationale of our recommendation module. Since our model PETER has more parameters to learn, it may underfit on small datasets. This explains why it does not always perform the best on TripAdvisor and Amazon. When more training data are available to Transformer, usually the performance will become better, as evidenced by GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020). Thus, we can expect our model to perform well in real-world applications, where the training data are bigger than the testing datasets, e.g., billion-scale users in Amazon.

Ablation Study
In Table 6, we provide an ablation study conducted on the TripAdvisor dataset. After disabling the context prediction task L c by setting λ c = 0, the performances of both explainability and text quality drop dramatically, and the unique sentence ratio (USR) is nearly approaching Transformer's (see Table 2). It hence confirms this task's effectiveness.
As L c is highly correlated with the recommendation task L r via the user and item IDs (see Section 4.3), the removal of L c leads to slight improvement on recommendation performance. We can also observe a reversed phenomenon when we disable L r . When PETER masking is replaced by the Left-to-Right masking that prevents the model from accessing the item information, the recommendation performance drops sharply. Overall, PETER reaches an optimal situation, where its explainability, text quality and recommendation performance are all reasonably good.

Conclusion
We propose a simple and effective solution to address the personalized generation problem of Transformer, unleashing its language modeling power to generate explanations for recommender systems. Extensive experiments show that the solution is both effective and efficient. It opens up a new way of exploiting Transformer by designing good tasks instead of scaling up model size. There are various applications of personalized generation for which Transformer is still less explored. Our next step is to adopt our solution for personalized question answering systems and personalized conversational agents. We also plan to incorporate item images into the model, so as to generate visual explanations for recommendations, since "a picture is worth a thousand words". Another meaningful extension is to adapt the model to cross-lingual explanation generation, because international platforms, e.g., Amazon, may serve users who speak different languages.