Improving Document-Level Sentiment Analysis with User and Product Context

Past work that improves document-level sentiment analysis by encoding user and product in- formation has been limited to considering only the text of the current review. We investigate incorporating additional review text available at the time of sentiment prediction that may prove meaningful for guiding prediction. Firstly, we incorporate all available historical review text belonging to the author of the review in question. Secondly, we investigate the inclusion of his- torical reviews associated with the current product (written by other users). We achieve this by explicitly storing representations of reviews written by the same user and about the same product and force the model to memorize all reviews for one particular user and product. Additionally, we drop the hierarchical architecture used in previous work to enable words in the text to directly attend to each other. Experiment results on IMDB, Yelp 2013 and Yelp 2014 datasets show improvement to state-of-the-art of more than 2 percentage points in the best case.


Introduction
Document-level sentiment analysis aims to predict sentiment polarity of text that often takes the form of product or service reviews. Tang et al. (2015) demonstrated that modelling the individual who has written the review, as well as the product being reviewed, is worthwhile for polarity prediction, and this has led to exploratory work on how best to combine review text with user/product information in a neural architecture (Chen et al., 2016;Ma et al., 2017;Dou, 2017;Long et al., 2018;Amplayo, 2019;Amplayo et al., 2018). A feature common amongst past studies is that user and product IDs are modelled as embedding vectors whose parameters are learned during training. We take this idea a step further and represent users and products using the text of all the reviews belonging to a single user or product -see Fig. 1 (left).
There are two reasons to incorporate review text into user/product modelling. Firstly, the reviews from a given user will reflect their word choices when conveying sentiment. For example, a typical user might use words such as fantastic or excellent with correspondingly high ratings but another user could use the same words sarcastically with a low rating. Similarly, a group of users writing a review of the same product may use the same or similar opinionated words to refer to that product. Secondly, learning meaningful user and product embeddings that are only updated by back propagation is difficult when a user or product only has a small number of reviews, whereas one may still be able to glean something useful from the text of even a small number of reviews.
A naive approach might compute representations of all the reviews of a given user or product each time we have a new training sample but this would be too expensive, and we instead propose the following incremental approach: With each new training sample, we obtain the review text representation, with BERT (Devlin et al., 2019) as our encoder, before using the representation together with user and product vectors to obtain a user-biased document representation and a product-biased document representation, which are then employed to obtain sentiment polarity. We then add the user-biased and product-biased document representations to the corresponding user and product vectors, so that they are ready for the next sample. In doing so, we incrementally store and update representations of reviews for a given user and product. Unlike Ma et al. (2017), who use a hierarchical structure in which sentence representations are first computed before being combined into a document representation, we let the words in the text directly attend to each other. The architecture we propose is depicted on the righthand side of Fig. 1 and is explained in more detail in Section 2.
We compare performance with a range of systems and results show that our approach works, improving on state-of-the-art results for all three benchmark datasets (IMDB,. 1 We also compare to a version of our own system which does not use the review text representations to encode user and product information. While it performs competitively with other systems, demonstrating the efficacy of our basic architecture, it does not work as well as our proposed system, particularly for reviews written by users or products with only a small number of reviews.

Methodology
An overview of our model architecture is shown in Figure 1 (right). The input to our model consists of d, u, p, which are the document, the user id and the product id respectively. u and p are both mapped to embedding vectors, E u , E p ∈ R h . d is fed into the BERT encoder to generate a document representation H d ∈ R L×h where L is the length of document after tokenization. We then inject E u and E p , to get the user-product biased document representation H biased ∈ R h . Finally, we feed the biased document representation H biased into a linear layer followed by a softmax layer to get the distribution of the sentiment label y. We use cross-entropy to calculate the loss between the predictions and ground-truth labels.
Injecting user and product preferences We adopt stacked multi-head-attention (Q, K, V ) (Vaswani et al., 2017) to model the connections between the current document and user/product vectors, which in this work correspond to all historical reviews composed by the user or about the product to date. In a typical dot-product attention E u and E p are regarded as queries, H d as keys and values. We compute the user-specific document representation, C t u , and product-specific document representation, C t p as follows: , and t is the number of layers of the attention function. In Equation (1) We adopt a gating mechanism to obtain importance vectors, z u and z p , to control the contribution of user-specific and product-specific document representations to the output classification: Finally, we obtain the biased document representation H biased by: where H cls ∈ R h is the final hidden vector of the [CLS] token (Devlin et al., 2019) and is element-wise product.
Updating the user and product matrix To implement our idea of using all reviews composed by u and all reviews about p, we incrementally add the current user/product-specific document representation to the corresponding entries in the embedding matrix at each step during training: where λ u and λ p are both learnable real numbers that control the degree to which the representation of the current document should be employed.

Experimental Setup
Our experiments are conducted on the IMDB, Yelp-13 and Yelp-14 benchmark datasets, statistics of which are shown in Table 1. We use the BERT-base model from HuggingFace (Wolf et al., 2019). We train our model with a learning rate chosen from {8e-6, 3e-5, 5e-5}, and a weight decay rate chosen from {0, 1e-1, 1e-2, 1e-3}, the optimizer we use is AdamW (Loshchilov and Hutter, 2019). In our experiments, the number of attention layers t is set to 5. The maximum sequence length to BERT is 512. We select the hyper-parameters achieving the best results on the dev set for evaluation on the test set. Evaluation metrics (Accuracy and RMSE) are calculated using scripts from Scikit-learn (Pedregosa et al., 2011

Results
Our experimental results are shown in Table 2. Our proposed model is named IUPC (Incorporating User-Product Context). The first two rows are baseline models: BERT VANILLA which is the basic BERT model without user and product information, i.e. only review text, and IUPC W/O UPDATE, which is the same as our proposed model except that we do not update the user and product embedding matrix by incrementally adding the new review representations. The third row shows our proposed model. We also compare with results from the NLP-progress leaderboard 3 of the following models: CHIM (Amplayo, 2019) adopts a chunkwise matrix representation for user/product attributes; injects user/product information in different locations. CMA (Ma et al., 2017) A hierarchical LSTM encoding the document; injects user and product information hierarchically. DUPMN (Long et al., 2018) encodes the document using a hierarchical LSTM; adopts two memory networks, one for user information and another for product information. HCSC (Amplayo et al., 2018) A combination of CNN and Bi-LSTM as the document encoder; injects user/product information with bias-attention.
HUAPA (Wu et al., 2018) adopts two hierarchical models to get user and product specific document representations respectively. NSC (Chen et al., 2016) A hierarchical LSTM encoder incorporating user/ product attributes with word and sentence-level attention. RRP-UPM (Yuan et al., 2019) uses two memory networks besides the user/product embeddings to get refined representations for user/product information. UPDMN (Dou, 2017)   Our model achieves the best classification accuracy and RMSE on Yelp-2013 and Yelp-2014, and the best RMSE on IMDB. It outperforms previous state-of-the-art results by 1.5 accuracy and 0.042 RMSE on Yelp-2013, by 2.1 accuracy and 0.029 RMSE on Yelp-2014, and by 0.01 RMSE on IMDB. Moreover, it outperforms the two baselines, BERT VANILLA and IUPC W/O UPDATE in both classification accuracy and RMSE on all three datasets. Although the classification accuracy of our model on IMDB is lower than most of the previous models, we suspect this is because the BERT model is not good at handling longer documents since the input length to BERT is fixed and the average length of documents in IMDB dataset is much longer than the other two datasets. However, it is worth noting that our model achieves the lowest RMSE which means the predictions of our model are closer to the gold labels.

Analysis
We analyse the results for reviews whose user or product do not have many reviews in the training set and compare our model's performance to the IUPC W/O UPDATE baseline for one dataset (Yelp-2013 dev).
We select only reviews where the number of reviews by that user or for that product falls below three thresholds: 40%, 60%, 80%, where % stands for the number of reviews for a given user/product relative to the average number of reviews for all users/products. Table 3 shows that our model performs better than IUPC W/O UPDATE when there are only a small number of previous reviews available for a given product/user. In other words, when a user or product does not have many reviews, its IUPC W/O UPDATE embedding which is only updated by gradient descent, cannot capture user/product preference as well as our model which explicitly takes advantage of historical review text in its user/product representations.  Table 3: Analysis of three lower-resource scenarios where % denotes a threshold filter corresponding to the proportion of reviews available relative to the average number in the dataset Yelp-2013 (dev).
In order to get a better idea of where there is room for improvement for IUPC, we examine the 43 Yelp-13 dev set cases, where the predicted label differs from the gold label by more than two points. There are a handful of cases of sarcasm, e.g. that lovely tempe waste/tap water taste in the food, but the most noteworthy phenomenon is mixed sentiment, e.g. tacos were good the soup was not tasty, or the more subtle brave the scary parking and lack of ambiance. It is not always clear from the reviews which aspect of the service the rating is directed towards. This suggests that aspect-based sentiment analysis (Pontiki et al., 2014) might be useful here, and training an IUPC model for this task is a possible avenue for future work.

Conclusion
In this paper, we propose a neural sentiment analysis architecture that explicitly utilizes all past reviews from a given user or product to improve sentiment polarity classification on the document level. Our experimental results on the IMDB, Yelp-13 and Yelp-14 datasets demonstrate that incorporating this additional context is effective, particularly for the Yelp datasets. The code used to run the experiments is available for use by the research community. 4