Pre-trained Personalized Review Summarization with Effective Salience Estimation

Personalized review summarization in recommender systems is a challenging task of generating condensed summaries for product reviews while preserving the salient content of reviews. Recently, Pretrained Language Models (PLMs) have become a new paradigm in text generation for the strong ability of natural language comprehension. However, it is nontrivial to apply PLMs in personalized review summarization directly since there are rich personalized information (e.g., user preferences and product characteristics) to be considered, which is crucial to the salience estimation of input review. In this paper, we propose a pre-trained personalized review summarization method, which aims to effectively incorporate the personalized information of users and products into the salience estimation of the input reviews. We design a personalized encoder that could identify the salient contents of the input sequence by jointly considering the semantic and personalized information respectively (i.e., ratings, user and product IDs, and linguistic features), yielding personalized representations for the input reviews and history summaries separately. Moreover, we design an interactive information selection mechanism that further identiﬁes the salient contents of the input reviews and selects relative information from the history summaries. The results on real-world datasets show that our method performs better than the state-of-the-art baselines and could generate


Introduction
Personalized review summarization aims to generate brief summaries for product reviews and preserves the main contents of input reviews.It could help users to have a full insight of products quickly and make accurate purchase decisions, hence it is an important task in recommender systems and attracts more and more attention recently (Ganesan Figure 1: An example of a product review and the corresponding summary.Different history summaries mention different significant features of the given review and we mark them in different colors.et al., 2010;Di Fabbrizio et al., 2014;Xiong and Litman, 2014;Gerani et al., 2014;Li et al., 2017;Chan et al., 2020).Different from the traditional summary generation task, reviews are generally coupled with a lot of essential information about users and products.Hence, some recent works propose to generate personalized summaries for reviews by considering user preferences and product characteristics (Li et al., 2017;Yang et al., 2018;Li et al., 2019c,a,b;Chan et al., 2020;Xu et al., 2021).For example, Li et al. (2019a) propose a user-aware encoder-decoder framework that considers the user preferences in the encoder to select important information and incorporate the preferences and writing style of users into the decoder to generate personalized summaries.
Recently, pre-trained language models (PLMs) have achieved notable improvements in various text generation tasks including abstractive summarization (Amplayo et al., 2021;Li et al., 2022).However, the exploration of the pre-training paradigm in review summarization is quite preliminary since applying PLMs to review summarization directly is nontrivial.The existing PLMs ignore the personalized information about users and products which is crucial to generate personalized summaries for product reviews.First, the salience of the input reviews is not only dependent on the semantic information but also influenced by the personalized information.For example, different users have different preferences towards product characteristics, in Figure 1, the user of history summary 3) is interested in "quality" while the user of history summary 1) focuses more on "price".Therefore, it is necessary to consider user preferences and product characteristics when calculating the salience for the contents of the input review.Second, history summaries of users and products convey rich text descriptions, which could not only be used to strengthen salient information identification of the input reviews but also be fed into the generation process as additional input.For example, some important aspects (e.g., "price", "size", etc.) of products are usually mentioned by different users, hence history summaries might contribute to the salience calculation of the input reviews.
Therefore, it is essential to fine-tune PLMs effectively to make the model could generate personalized summaries by jointly considering semantic information of input reviews and the essential attributes of users and products.However, the existing PLMs focus on the text content, and it is challenging to integrate the various kinds of auxiliary information into PLMs selectively and effectively.In this paper, we propose an encoder-decoder Pre-trained Personalized Review Summarization method with effective salience estimation of the rich input information, named PPRS.Specifically, we design two kinds of mechanisms to leverage the user and product information to identify the salient contents of the input reviews and history summaries, making the model could focus on the more relevant content towards the current summary generation.
First, we propose a personalized encoder that learns representations for the input reviews and each history summary separately.Considering the user and product IDs indicate their intrinsic characteristics, the personalized encoder aligns each word and the corresponding user, product, and rating.In this way, the salience of words of the input reviews is influenced by both semantic and personalized information, yielding personalized representations.Furthermore, we observe that linguistic features are generally associated with the user opinions and product characteristics, such as users utilizing adjectives to represent their sentiment (e.g., "comfort-able", "good" in Figure 1) and aspects of products are usually nouns (e.g., "price", "size" in Figure 1).Therefore, we also aggregate the part-of-speech feature into the personalized encoder to identify the salient content of the input reviews more accurately.
Second, we propose an interactive information selection mechanism that interactively models the input reviews and history summaries to learn more comprehensive representations.On the one hand, considering history summaries are usually noisy and redundant, we select the relevant information from history summaries by calculating the semantic relatedness between history summaries and the input reviews.On the other hand, we learn the history summaries-aware salience for the input reviews by calculating the semantic similarity between words of the input reviews and history summaries.Finally, we combine the input reviews and history summaries as the input of the decoder to generate coherent and personalized summaries.
The main contribution is threefold: (1) we propose a PLM-based personalized review summarization method that conducts salience estimation by jointly considering the user and product information.
(2) we design two mechanisms to incorporate the personalized information into the generation process, i.e., the personalized encoder and an interactive information selection module.(3) we conduct extensive experiments and the results show that our method outperforms competitive baselines.

PROPOSED METHOD
In this paper, we conduct a review summarization based on Transformer (Vaswani et al., 2017) encoder-decoder architecture initialized with T5 (Raffel et al., 2020).In this section, we first introduce the problem formulation and then describe our method from two aspects: the personalized encoder module as shown in Figure 2 and the decoder with interactive information selection as shown in Figure 3.

Problem Formulation
Given review X and the corresponding personalized information A = {u, v, r, S}, our method aims to generate personalized summary Ŷ , where u is the user ID, v is the product ID, r is the rating given by u to v, and S is the set of history summaries.Especially, the history summary set summaries of the corresponding user u and product v.In this paper, the input review is represented as X = {w 1 , w 2 , • • • , w L } where w i is the i-th word and L is the number of words.Besides, the generated and reference summaries are denoted as respectively, where T and T is the number of words of generated and reference summary respectively.

Personalized Encoder
In this section, we introduce the personalized encoder which learns a comprehension representation by jointly considering the semantic information and various attributes of the corresponding user and product.As shown in Figure 2, the input reviews and each history summary are encoded separately, hence we take the input review X as an example to introduce the personalized encoder.
In contrast to traditional summarization, we need to consider user preferences, writing style, and product characteristics in review summarization, in order to select the salient content of the input reviews accurately for different users/products.Besides, the rating reflects the sentiment tendency of users toward current products, and hence could be utilized to identify the useful content of input reviews.Therefore, we propose to align each word to rating, user, and product IDs, aiming to learn more comprehension representation for the input review.Additionally, linguistic features are important to identify the salient content of input texts, such as adjectives typically reflect users' opinions (e.g., "good", "bad", etc), and nouns generally reflect product characteristics (e.g., "speed", "price", etc).Therefore, we propose to incorporate the partof-speech feature into the encoder by considering the part-of-speech of each word in the embedding layer.Finally, the embedding for each word e is denoted as follows: e = e t + e p + e pos + e r + e u + e v , Where e t and e p ∈ R de , e pos , e r , e u , and e v ∈ R da are token, position, part-of-speech, rating, user ID and product ID embedding respectively.Subsequently, the input review is fed into Transformer encoder layers (Vaswani et al., 2017).Specifically, the encoder consists of stacked identical layers, where each layer has two sub-layers: a self-attention network and a fully connected feedforward network.The encoder could learn a comprehensive representation of the input long se-  quence by jointly considering semantic and personalized features during the calculation in each encoder layer.
As a result, our method could select the salient content of the input review, which is not only based on semantic information but also reflect user and product characteristics After the personalized encoder, we could obtain the input review representations

Interactive Information Selection
Based on the learned representations of the input reviews and history summaries, in this section, we propose an interactive information selection module to interactively model the input reviews and history summaries.As shown in Figure 3, this module intents to further identify the salient content of the input review in terms of history summaries, meanwhile selecting the important information of history summaries relevant to current summary generation.
Intuitively, some content of history summaries is less relevant to the main point of input review.For example, different users focus on different aspects of the current product, hence different history summaries of products have different relevance to the current summary generation.Therefore, we design a relevance attention mechanism that utilizes the input review as the query to select the relevant content from history summaries and it is calculated as follows: where W Q X , W K S , W V S are learnable parameters.Then, history summaries having more semantic similarity with input review would get more attention, which are then treated as an auxiliary feature to strengthen summary generation.
For input review, the personalized encoder has captured the internal salience of the input review by modeling the relatedness between words of the review via the self-attention mechanism.In fact, history summaries contain rich descriptions of the user and product characteristics, which conveys more semantic information than the user and product IDs.More specifically, history summaries of users reflect users' writing styles and purchasing preferences, and history summaries of products describe the main aspects that users are interested in.Therefore, we design another salience attention mechanism to capture the history summaries-aware salience for the input review and it is calculated as follows: where S is the concatenation of all history summaries, W Q S , W K X , W V X are learnable parameters.In this way, our method could identify the salient content of the input review more effectively.
Finally, the concatenation of input review and history summaries is fed into the pre-trained transformer decoder which generates the target summaries Ŷ word by word.The decoder also consists of stacked identical layers, in which there is an additional encoder-decoder self-attention to align the generation states and input sequences besides the two sub-layers in the encoder layer.

Model Training
For the summary generation task, we use the negative log-likelihood as the loss function (NLLLoss) to train the model: where T is the length of the generated review summary, P (ŷ t ) is the probability distribution of the t-th word, and φ is model parameters.
3 Datasets and Experimental Settings

Datasets
In this section, we introduce the dataset statistics and hyperparameters settings in experiments.To validate the effectiveness of our method, we conduct extensive experiments on three real-world datasets from Amazon1 :Movies and TV, Sports and Outdoors, and Home and Kitchen.Each sample of the dataset contains the user ID, product ID, rating, review, and summary text.Following previous work (Ma et al., 2018), we randomly select 1000 samples as testing and validation set separately and treat other samples in the dataset as the training dataset.In this paper, we only reserve the reviews given by active users to popular products, where each user and each product has at least K history reviews, where K = 5 for the Sports dataset, K = 10 for the Home dataset, and K = 20 for the Movie dataset.In the experiment, we utilize M = 20 history summaries.For reviews that have more than M history summaries, we select top-M history summaries that have more common words with the input review.The maximum length of reviews and summaries are set to L = 200 and T = 15 respectively.The dataset statistics are listed in Table 1.

Baselines
In this section, we compare our method with several state-of-the-art review summarization methods.
(1) the methods without user and product (3) the methods based on the pre-trained language model: we compare our method with the original T5 (Raffel et al., 2020) method and T5-FT which fine-tunes T5 on the recommendation datasets by generating summaries from product reviews.

Implementation Details
The hyper-parameters in our model are tuned from the validation dataset.The dimension of the hidden state d e and attribute embedding (e.g., user ID) size d a is set to 512.We use t5-small 2 to initialize the encoder and decoder parameters.We utilize the AdamW (Loshchilov and Hutter, 2019) algorithm to optimize our model and the learning rate is 0.0004.For the parameters in the training, we set the batch size to 32.For baselines, we use open source code for HSSC, Dual-view, memAttr, and Transformer.And we implement S2S-att, PGN, USN, TRNS and keep the same setting as the original papers.As for metrics, we use the widely used metrics ROUGE (Lin, 2004) 3 to evaluate the performance of our model on summary generation, including ROUGE-1, ROUGE-2, and ROUGE-L.Finally, the experiment platform is GeForce GTX 1080Ti with 128GB memory; we independently repeat each experiment 5 times and present the 2 https://huggingface.co/t5-small 3 https://github.com/chakki-works/sumevalaverage performance.

Performance Evaluation
The results are listed in Table 2, from which we could have the following observations.First, our method outperforms methods without the user and product information (e.g., Transformer) by a large margin.The main reason is that user and product features are crucial to generate high-quality summaries for product reviews in the recommendation scenario.During these methods, PLMs-based methods (i.e., T5 and T5-FT) achieve better performance than other models.This is because these methods have strong text understanding ability obtained from the pre-training process which is helpful to identify the salient contents of the input review more accurately.
Second, our method achieves better performance than other methods that also leverage personalized information of users and products.Because our method could conduct more effective salience estimation by jointly considering semantic information and personalized information in the encoder and information selection, which further boosts summary generation.It should be noted that methods fusing history texts and discrete attributes (e.g., TRNS, memAttr) perform better than methods only based on discrete (e.g., HSSC, Dual-view).The possible reason is history texts convey more semantic information about user writing style and product characteristics which are clues to generate personalized summaries.
Third, we can see that our method performs better than PLMs-based baselines (i.e., T5 and T5-FT).In fact, T5 ignores the domain knowledge in recommendations resulting in poor performance, while T5-FT achieves better performance by learn-ing the domain knowledge by fine-tuning PLMs on the reviews dataset.However, T5-FT still performs poorly than our method since it ignores the user preferences and product characteristics which play a crucial role in review summarization.Our method could incorporate this information into the salience calculation of the input reviews effectively and feed the relevant history summaries into the decoder, which both contribute to the improvement of our method.These results indicate that our method could fine-tune the pre-trained language model more effectively.

Ablation Study
To verify the effectiveness of important components of our method, in this section, we conduct an ablation study experiment by removing each component separately.Specifically, (1): "w/o H" denotes removing the history summaries information in the decoder module; (2): "w/o H" denotes removing the salience attention in the interactive information selection module; (3): "w/o S" denotes removing the interactive information selection module; (4): "w/o E" denotes removing the newly denoted embeddings (i.e., rating, user ID, product ID, and part-of-speech) in the personalized encoder module.The results are shown in Table 3 and we have the following observations.We can see that removing any component would make the performance decline.First, removing discrete attributes and linguistic features makes our method could not identify the salient content of the input reviews that reflect user preferences and product characteristics effectively, resulting in the loss of personalized information in the generation process and achieving worse performance.Second, removing the information selection module and history summaries makes our method could not identify the salient content of input reviews more accurately and lost the important information from the history summaries, hurting the performance on summary generation.In addition, all variants outperform T5-FT which directly fine-tunes T5 on the review dataset without considering the personalized features.In all, these results validate the effectiveness of these important components.

Discussion
In this section, we conduct experiments to analyze the influence of different strategies which incorporate user preferences and product characteristics into the summary generation process.
Firstly, we design two variants to explore the effectiveness of different mechanisms to utilize the history summaries to expand the capacity of models.(1): "PPRS-CD" encodes the input review and history summaries separately, then feeds the concatenation of learned semantic representations into the decoder.(2):"PPRS-CE" directly feed the concatenation of the input review and history summaries into the encoder-decoder framework to produce summaries.The results are listed in Figure 4. We can see that "PPRS-CD" performs worse than "PPRS" after replacing the interactive information selection with a concatenation operation.Because there is generally some irrelevant content in history summaries which might make the decoder confused about the input text and hurt the quality of the generated summaries.Then, "PPRS-CE" also achieves worse performance than "PPRS" after conducting information fusion in the encoder module.The possible reason is encoder could not distinguish the input reviews and history summaries and further fails to learn accurate semantic representations for them respectively, resulting in the performance declines.
Secondly, we design two variants to explore different strategies to utilize the discrete personalized information,i.e., rating, user and product IDs.(1): "PPRS-AG" utilizes rating, user and product id embeddings as a gate to select the relevant information from the text (i.e., input review and history summaries) representations after the encoder.( 2): "PPRS-AW" treats rating, user and product id as special words and adds these embeddings to the beginning of the reviews and history summaries.The results are listed in Figure 5.
We can observe that the performance of summary generation begins to decline when applied the gate mechanism to conduct information selection in "PPRS-AG".This is because in this case, "PPRS-AG" could not conduct salience estimation well without the deep interaction between the user id and input text, leading the decoder could not generate more accurate summaries.Besides, "PPRS-AW" performs worst compared with other methods.The main reason is it could not identify the salient words in terms of user preferences and product characteristics in the encoder module effectively, which makes 'PPRS-AW" could not learn more comprehensive representations of input review and history summaries.However, our method aligns each word with these discrete attributes in the embedding layer which could incorporate the essential characteristics of users and products into the text encoding process effectively and further boosts the generation process.

Human Evaluation
In this section, we perform a human evaluation to further evaluate the performance of our model.Specifically, we define three metrics: (1) Informativeness evaluates whether the generated summaries convert the main content of the input review.
(2) Accuracy evaluates whether the generated summaries are consistent with the sentiment tendency reflected in the input review.(3) Readability evalu- ates whether the generated summaries are grammatically correct and easy to understand.It is difficult to develop automatic evaluation methods for these metrics.Hence, we randomly sample 100 cases and invite 5 human volunteers to read and rate all generated summaries, where 1 means "very bad" and 5 means "very good".The scores are averaged across all volunteers and cases.The results are listed in Table 4 and we have the following observations.First, our method outperforms other methods on Informativeness and Accuracy.Because our method could incorporate user preferences and product characteristics into the salience calculation of input reviews more effectively, which is helpful to capture the main content and keep the sentiment consistent with the input review.Second, the generated summaries of our method are more readable than others (e.g., T5-FT).The main reason is our method not only has rich grammar knowledge and text generation ability obtained from the pre-trained process, but also jointly considers user writing style by taking history summaries as auxiliary features.In summary, these results show that our method could generate high-quality summaries.

Case Study
In this section, we conduct a case study and we list several generated summaries of our method in Table 6.And we have the following observations.First, generated summaries preserve the main contents of the input review and they have the same sentiment tendency.For example, in the first case, generated summaries and reviews both mention "training ammo" and convey a positive opinion (i.e., "Recommended") towards the product.Second, generated summaries are semantically similar to the corresponding reference summaries, such as, they both mention that "Doesn't work for recoil" in the second case.Third, generated summaries reflect user preferences and product characteristics.For example, in the third case, the generated summary contains two main features of the product (i.e., "little" and "great hunting") and indicates that the corresponding user cares more about the "quality".In all, our method could generate coherent and personalized summaries for product reviews.
5 Related Work

Personalized Review Summarization
Personalized review summarization is an important task in the recommender system, which aims to generate brief summaries for product reviews.Different from the previous text summarization (Gehrmann et al., 2018;Li et al., 2020;Zhang et al., 2019), product reviews usually have various personalized information (e.g., rating, user and product IDs, and history text, etc.) which plays a crucial role in summary generation (Yang et al., 2018;Dong et al., 2017).
Recently, some approaches (Ganesan et al., 2010;Xiong and Litman, 2014;Carenini et al., 2013;Di Fabbrizio et al., 2014;Liu and Wan, 2019;Li et al., 2019b,c;Chan et al., 2020) are proposed for review summarization.Some methods incorporate the discrete attributes (e.g., rating, user and product IDS) into salient information selection.Li et al. (2019a) design a selective mechanism that utilizes user embedding to select user-preference words and generate a personalized summary by incorporating user-specific vocabulary.In addition, some methods also leverage aspect information to enhance review summarization (Yang et al., 2018;Tian et al., 2019).However, most of them ignore the joint consideration of the discrete attributes and history text.Therefore, Liu et al. (2019) calculate the semantic similarity between input review and history review to aggregate history summaries into context vectors which are then utilized to generate summaries.Xu et al. (Xu et al., 2021) conduct deep interaction between input reviews and history summaries to infer the important parts among history summaries and generate personalized summaries by reasoning over the user-specific memory.

Pre-trained Language Model
Recently, pre-trained language models (PLMs) have advanced the performance of various NLP tasks, such as sentiment analysis (Yu et al., 2021;Wu and Shi, 2022), text summarization (Liu and Lapata, 2019;Xiao et al., 2020;Oved and Levy, 2021), etc. Liu and Lapata (2019) propose to conduct summary generation in both extractive and abstractive modeling paradigms by utilizing BERT (Kenton and Toutanova, 2019) as an encoder to learn text representations.Oved and Levy (2021) generate opinion summaries for products by aggregating a set of reviews for the given product and significantly reduce the self-inconsistencies between multiple history reviews.However, these methods might perform poorly in personalized review summarization, since they ignore the rich characteristics of users and products which is important to generate high-quality summaries for reviews.In this paper, we propose to fine-tune PLMs to conduct more effective salience estimation for input reviews by jointly considering semantic information and personalized features of users and products.

Conclusion
In this paper, we propose a novel review summarization method based on the pre-trained language models.The core of our method is fine-tuning the pre-trained language models by considering the user preferences and product characteristics.Especially, we design a personalized encoder to learn representations for the input reviews and each history summary separately by incorporating the user and product characteristics into the encoder module.Additionally, we propose an interactive information selection module to further identify the salient content of the input review and select the relevant information from history summaries.Experimental results show that our method achieves better performance than competitive baselines.

Limitations
Our method has some limitations that we would like to explore in the future.Firstly, our method is based on the PLMs which require large GPU resources to train and infer models.We would like to adopt knowledge distillation technology to reduce the number of model parameters while keeping the performance as much as possible.Secondly, the summary generation process still lacks enough controllability even though we incorporate various features of users and products into the saliency estimation and auxiliary inputs of the decoder.In the future, we explore aggregating the characteristics of users and products into the decoder layers to make the generation process more controllable.The human evaluation for review summarization is simple to conduct, including Informativeness, Accuracy, and Readability.Hence, we do not design text instruction specifically.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?5 D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 2 :
Figure 2: The personalized encoder framework of our method.

Figure 3 :
Figure3: The decoder framework of our method.It first conducts interactive information selection for input review and history summaries, then generates personalized summaries based on the selected content.

Figure 4 :
Figure 4: Performance of different mechanisms to utilize history summaries to expand the capacity of models.

Figure 5 :
Figure 5: Performance on different mechanisms to leverage the discrete attributes (i.e., IDs and rating).

Figure 6 :
Figure 6: Examples of generated and reference reviews.

Table 3 :
Ablation experiments on the Sports dataset.

Table 4 :
Human evaluation on Sports dataset.
Yang Liu and Mirella Lapata.2019.Text summarization with pretrained encoders.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730-3740.Ilya Loshchilov and Frank Hutter.2019.Decoupled weight decay regularization.In International Conference on Learning Representations.Shuming Ma, Xu Sun, Junyang Lin, and Xuancheng Ren.2018.A hierarchical end-to-end model for jointly improving text summarization and sentiment classification.In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 4251-4257.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? 3 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?3,4 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?No response.D Did you use human annotators (e.g., crowdworkers) or research with human participants?4 D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?