Exploiting Rich Textual User-Product Context for Improving Personalized Sentiment Analysis

User and product information associated with a review is useful for sentiment polarity prediction. Typical approaches incorporating such information focus on modeling users and products as implicitly learned representation vectors. Most do not exploit the potential of historical reviews, or those that currently do require unnecessary modifications to model architecture or do not make full use of user/product associations. The contribution of this work is twofold: i) a method to explicitly employ historical reviews belonging to the same user/product in initializing representations, and ii) efficient incorporation of textual associations between users and products via a user-product cross-context module. Experiments on the IMDb, Yelp-2013 and Yelp-2014 English benchmarks with BERT, SpanBERT and Longformer pretrained language models show that our approach substantially outperforms previous state-of-the-art.


Introduction
It has been repeatedly shown that the user and product information associated with reviews is helpful for sentiment polarity prediction (Tang et al., 2015;Chen et al., 2016;Ma et al., 2017).Just as the same user is expected to have consistent narrative style and vocabulary, the reviews belonging to the same product are expected to exhibit similar vocabulary for specific terms.Most previous work models user and product identities as representation vectors which are implicitly learned during the training process and only focus on the interactions between either the user or product and the review text (Dou, 2017;Long et al., 2018;Amplayo, 2019;Zhang et al., 2021;Amplayo et al., 2022).This brings with it two major shortcomings: i) the associations between users and products are not fully exploited, and, ii) the text of historical reviews is not used.
To tackle the first shortcoming, Amplayo et al. (2018) propose to incorporate similar user and product representations for review sentiment classifica-

It 's definitely a place for a cool date…… I wouldn't do dinner here again……
This place has a really chill , almost a hipster feel…… ………………..

The food is the reason for the rating…… It packs heats, but for me its lacking in the heat……
The food were fine but absolutely nothing special ………………..

What kind of users like this product?
What kind of products this user like?
Similar users Similar products

Historical reviews Historical reviews
Figure 1: Our proposed idea of representing users and products with their historical reviews and incorporating the associations between users and products.
tion.However, their approach ignores the associations between users and products.To tackle the second shortcoming, Lyu et al. (2020) propose to explicitly use historical reviews in the training process.However, their approach needs to incrementally store review representations during the training process, which results in a more complex model architecture, where the magnitude of the user and product matrix is difficult to control when the number of reviews grow very large.As shown in Figure 1, we propose two simple strategies to address the aforementioned issues.Firstly, we use pre-trained language models (PLMs) to pre-compute the representations of all historical reviews belonging to the same user or product.Historical review representations are then used to initialize user (or product) representations by average pooling over all tokens before again average pooling over all reviews.This allows historical review text to inform the user and product preference, which we believe is potentially more advantageous than implicitly learned representations.Time and memory costs are minimized compared to (Lyu et al., 2020) since the representations of historical reviews are average pooled and the precomputation is one-time.
Secondly, we propose a user-product cross-context module which interacts on four dimensions: user-to-user, product-to-product, user-to-product and product-to-user.The former two are used to obtain similar user (or product) information, which is useful when a user (or product) has limited reviews.The latter two are used to model the product preference of the user (what kind of products do they like and what kind of ratings would they give to similar products?)and user preference associated with a product (what kinds of users like such products and what kinds of ratings would they give to this product?).We test our approach on three benchmark English datasets -IMDb, Yelp-2013, Yelp-2014.Our approach yields consistent improvements across several PLMs (BERT, SpanBERT, Longformer) and achieves substantial improvements over the previous state-of-the-art.

Methodology
An overview of our approach is shown in Figure 2. We firstly feed the review text, D, into a PLM encoder to obtain its representation, H D .H D is then fed into a user-product cross-context module consisting of multiple attention functions together with the corresponding user embedding and product embedding.The output is used to obtain the distribution over all sentiment labels.The architecture design is novel in two ways: 1) the user and product embedding matrices are initialized using representations of historical reviews of the corresponding users/products, 2) a user-product crosscontext module works in conjunction with 1) to model textual associations between users and products.

Incorporating Textual Information of Historical Reviews
For the purpose of making use of the textual information of historical reviews, we initialize all user and product embedding vectors using the representations of their historical reviews.Specifically, assume that we have a set of users U = {u 1 , ......, u N } and products P = {p 1 , ......, p M }.
Each user u i and product p j have their corresponding historical reviews: For a certain user u i , we firstly feed D u i 1 into the transformer encoder to obtain its representation dimension: where Hu i D 1 ∈ R 1×h , L is the maximum sequence length, h is the hidden size of the transformer encoder, T u i D 1 is the total number of tokens in D u i 1 excluding special tokens.Therefore, we sum the representations of all tokens in D u i 1 and then average it to obtain a document vector Hu i D 1 .The same procedure is used to generate the document vectors of all documents in u i = {D u i 1 , ......, D u i n i }.Finally, we obtain the representation of u i by: where E u i ∈ R 1×h is the initial representation of user u i .The same process is applied to generate the representations of all the other users as well as all products.Finally, we have E U ∈ R N ×h and E P ∈ R M ×h as the user and product embedding matrix respectively.Moreover, in order to control the magnitude of E U , E P we propose scaling heuristics: where F-NORM is Frobenius norm, E is a normal matrix in which the elements E i,j are drawn from a normal distribution N (0, 1).The same process is applied to E P as well.

User-Product Information Integration
Having enriched user and product representations with historical reviews, we propose a user-product cross-context module for the purpose of garnering sentiment clues from textual associations between users and products.We use MULTI-HEAD ATTENTION (Vaswani et al., 2017) in four attention operations: user-to-user, product-to-product, user-to-product and product-to-user.Specifically, for MULTI-HEAD ATTENTION(Q,K,V), we use the user representation E u i or product representation E p j as Q and the user matrix E U and product matrix E P as K and V.For example, we obtain user-to-user attention output by: We follow the same schema to get E pp p j , E up u i and E pu p j .Additionally, we also employ two MULTI-HEAD ATTENTION operations between E u i /E p j (query) and H D (key and value).The corresponding outputs are E D u i and E D p j .We then combine the output of the user-product cross-context module and H cls to form the final representations.In Attn uu and Attn pp , we add attention masks to prevent E u i and E p j from attending to themselves.Thus we also incorporate E u i and E p j as their selfattentive representations: H d is fed into the classification layer to obtain the sentiment label distribution.During the training process, we use cross-entropy to calculate the loss between our model predictions and the gold labels.

Datasets
Our experiments are conducted on three benchmark English document-level sentiment analysis datasets: IMDb, Yelp-13 and Yelp-14 (Tang et al., 2015).Statistics of the three datasets are shown in Appendix A.1.All three are fine-grained sentiment analysis datasets: Yelp-2013 and Yelp-2014 have 5 classes, IMDb has 10 classes.Each review is accompanied by its corresponding anonymized user ID and product ID.

Experimental Setup
The pre-trained language models we employed in experiments are BERT (Devlin et al., 2019), Span-BERT (Joshi et al., 2020) and Longformer (Beltagy et al., 2020).We use the implementations from Huggingface (Wolf et al., 2019).The hyperparameters are empirically selected based on the performance on the dev set.We adopt an early stopping strategy.The maximum sequence is set to 512 for all models.For evaluation, we employ two metrics Accuracy and RMSE (Root Mean Square Error).
More training details are available in Appendix A.2

Results
Results on the dev sets of IMDb, Yelp-2013 and Yelp-2014 for the BERT, SpanBERT and Longformer PLMs are shown in Table 1.We compare our approach to a vanilla user and product attention baseline where 1) the user and product representation matrices are randomly initialized and 2) we simply employ multi-head attention between user/product and document representations without the user-product cross-context module.Our approach is able to achieve consistent improvements over the baseline with all PLMs on all three datasets.For example, our approach gives improvements over the baseline of 4.3 accuracy on IMDb, 1.6 accuracy on Yelp-2013 and 1.7 accuracy on Yelp-2014 for BERT-base.Moreover, our approach can give further improvements for large PLMs such as Longformer-large: improvements of 4.8 accuracy on IMDb, 2.8 accuracy on Yelp-2013 and 2.1 accuracy on Yelp-2014.The improvements over the baseline are statistically significant (p < 0.01)1 .
In Table 2, we compare our approach to previous approaches on the test sets of IMDb, Yelp-2013 and Yelp-2014.These include pre-BERT neural models -RRP-UPM (Yuan et al., 2019) and CHIM (Amplayo, 2019) -and state-of-the-art models based on BERT -IUPC (Lyu et al., 2020), MA-BERT (Zhang et al., 2021) and Injectors (Amplayo et al., 2022). 2 We use BERT-base for a fair comparison with IUPC, MA-BERT and Injectors, which all use BERT-base.Our model obtains the best performance on IMDb, Yelp-2013 and Yelp-2014, achieving absolute improvements in accuracy of 0.1, 1.2 and 0.9 respectively, and improvements in RMSE of 0.011, 0.018 and 0.010 respectively.

Ablation Study
Results of an ablation analysis are shown in Table 3.The first row results are from a BERT model without user and product information.The next three rows correspond to: 1) User-Product Information, where we use the same method in the baseline vanilla attention model in Table 1 to inject userproduct information; 2) Textual Information, our proposed approach of using historical reviews to initialize user and product representations; 3) User-Product Cross-Context, our proposed module incorporating the associations between users and products.The results show, firstly, that user and product information is highly useful for sentiment classification, and, secondly, that both textual information of historical reviews and user-product cross-context can improve sentiment classification.

Varying Number of Reviews
We investigate model performance with different amounts of reviews belonging to the same user/product.We randomly sample a proportion of each user's reviews (from 10% to 100%).Then we use the sampled training data, where each user only has part of their total reviews (e.g.10%), to train sentiment classification models.We conduct experiments on Yelp-2013 and IMDb using IUPC (Lyu et al., 2020), MA-BERT (Zhang et al., 2021) and our approach.The results are shown in Figure 3, where the x-axis represents the proportion of reviews that we used in experiments.When the proportion of reviews lie between 10% and 50%, our approach obtains superior performance compared to MA-BERT and IUPC while the performance gain decreases when users have more reviews.The results show the advantage of our approach under a low-review scenario for users.

Scaling Factor for User/Product Matrix
We conduct experiments with different scaling factor (see Equations 3) on the dev sets of Yelp-2013 and IMDb using BERT-base.We apply the same scaling factor to both user and product matrix.The results are shown in Figure 4, where we use scaling factor ranging from 0.05 to 1.5 with intervals of 0.05.The results show that our proposed scaling factor (green dashed lines in Figure 4) based on the Frobenius norm can yield competitive performance: best accuracy according to the blue dashed line.Although the RMSE of the Frobenius norm heuristic is not always the optimal, it is still a relatively lower RMSE compared to most of the other scaling factors (except the RMSE of SpanBERT-base on IMDb).Moreover, the Frobenius norm heuristic can reduce the efforts needed to tune the scaling factor, since the optimal scaling factor is varying for different models on different data, whereas the Frobenius norm heuristic is able to consistently provide a competitive dynamic scaling factor.

Conclusion and Future Work
In order to make the best use of user and product information in sentiment classification, we propose a text-driven approach: 1) explicitly utilizing historical reviews to initialize user and product representations 2) modeling associations between users and products with an additional user-product crosscontext module.The experiments conducted on three English benchmark datasets -IMDb, Yelp-2013 and Yelp-2014 -demonstrate that our approach substantially outperforms previous stateof-the-art approaches and is effective for several PLMs.For future work, we aim to apply our approach to more tasks where there is a need to learn representations for various types of attributes, and to explore other compositionality methods for generating user/product representations.

Limitations
The method introduced in this paper applies to a specific type of sentiment analysis task, where the item to be analysed is a review, the author of the review and the product/service being reviewed are known and uniquely identified, and the author (user) and product information is available for all reviews in the training set.While our approach is expected to perform well on other languages beyond English, the experimental results do not necessarily support that since our evaluation is only carried out on English data.Our experiments are conducted on three benchmark English document-level sentiment analysis datasets: IMDb, Yelp-13 and Yelp-14 (Tang et al., 2015).Statistics of the three datasets are shown in

A.2 Hyperparameters
The metrics are calculated using the scripts in Pedregosa et al. (2011).All experiments are conducted on Nvidia GeForce RTX 3090 GPUs.We show the Learning Rate and Batch Size used to train our models on all datasets in Table 6.

A.3 Training Objective
We use Cross-Entropy to calculate the loss between our model predictions and the gold labels.
where n is the number of samples and m is the number of all classes, y i,j represents the actual probability of the i-th sample belonging to class j , y i,j is 1 only if the i-th sample belongs to class j otherwise it's 0. p(y i,j |D i , u i , p i ) is the probability the i-th sample belongs to class j predicted by our model.

A.5 Examples
Some cases sampled from the dev set of Yelp-2013 and corresponding predictions from Vanilla BERT w/o user and product information, IUPC (Lyu et al., 2020), MA-BERT (Zhang et al., 2021) and our model are shown in Table 8.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Section 3. B4.Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Section 3. B5.Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section 3 and appendix.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Section 3 and appendix.

C Did you run computational experiments?
Section 3 and appendix.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 3 and appendix.

Figure 4 :
Figure 4: Effect of varying the scaling factor for the user/product matrices on the dev sets of Yelp-2013 (left) and IMDb (right), with BERT-base (top) and SpanBERTbase (bottom).The left and right y-axis in each subplot represent Accuracy and RMSE respectively.The x-axis represents the scaling factor.The vertical green dashed line is the scaling factor from the Frobenius norm heuristic.The blue and orange horizontal dashed lines are the accuracy and RMSE produced by the Frobenius norm heuristic respectively.
you describe the limitations of your work?The last section.A2.Did you discuss any potential risks of your work?Section 3. A3.Do the abstract and introduction summarize the paper's main claims?Left blank.A4.Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Section 3. B1.Did you cite the creators of artifacts you used?Section 3. B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Section 3.

Table 1 :
Results of our approach on various PLMs on the dev sets of IMDb, Yelp-2013 and Yelp-2014.We show the results of the baseline vanilla attention model for each PLM as well as the results of the same PLM with our proposed approach.We report the average of five runs with two metrics, Accuracy (↑) and RMSE (↓).

Table 2 :
Experimental Results on the test sets of IMDb, Yelp-2013 and Yelp-2014.We report the average results of of five runs of two metrics Accuracy (↑) and RMSE (↓).The best performance is in bold.

Table 4 :
Number of documents per split and average doc length of IMDb, Yelp-2013 and Yelp-2014.

Table 4 .
The IMDb dataset has the longest documents with an average length of approximately 395 words.The average number of reviews for each user/product is shown in Table5.

Table 5 :
Number of useYelp-2014.ducts with average amount of documents for each user and product in IMDb,Yelp-2013 andYelp-2014.