Recommend for a Reason: Unlocking the Power of Unsupervised Aspect-Sentiment Co-Extraction

Compliments and concerns in reviews are valuable for understanding users' shopping interests and their opinions with respect to specific aspects of certain items. Existing review-based recommenders favor large and complex language encoders that can only learn latent and uninterpretable text representations. They lack explicit user attention and item property modeling, which however could provide valuable information beyond the ability to recommend items. Therefore, we propose a tightly coupled two-stage approach, including an Aspect-Sentiment Pair Extractor (ASPE) and an Attention-Property-aware Rating Estimator (APRE). Unsupervised ASPE mines Aspect-Sentiment pairs (AS-pairs) and APRE predicts ratings using AS-pairs as concrete aspect-level evidence. Extensive experiments on seven real-world Amazon Review Datasets demonstrate that ASPE can effectively extract AS-pairs which enable APRE to deliver superior accuracy over the leading baselines.


Introduction
Reviews and ratings are valuable assets for the recommender systems of e-commerce websites since they immediately describe the users' subjective feelings about the purchases.Learning user preferences from such feedback is straightforward and efficacious.Previous research on review-based recommendation has been fruitful (Chin et al., 2018;Chen et al., 2018;Bauman et al., 2017;Liu et al., 2019).Cutting-edge natural language processing (NLP) techniques are applied to extract the latent user sentiments, item properties, and the complicated interactions between the two components.
However, existing approaches have disadvantages bearing room for improvement.Firstly, they dismiss the phenomenon that users may hold different attentions toward various properties of the merchandise.An item property is the combination of an aspect of the item and the characteristic asso-ciated with it.Users may show strong attentions to certain properties but indifference to others.The attended advantageous or disadvantageous properties can dominate the attitude of users and consequently, decide their generosity in rating.
Table 1 exemplifies the impact of the user attitude using three real reviews for a headset.Three aspects are covered: microphone quality, comfortableness, and sound quality.The microphone quality is controversial.R2 and R3 criticize it but R1 praises it.The sole disagreement between R1 and R2 is on microphone, which is the major concern of R2, results in the divergence of ratings (5 stars vs. 3 stars).However, R3 neglects that disadvantage and grades highly (5 stars) for its superior comfortableness indicated by the metaphor of "pillow".
Secondly, understanding user motivations in granular item properties provides valuable information beyond the ability to recommend items.It requires aspect-based NLP techniques to extract explicit and definitive aspects.However, existing aspect-based models mainly use latent or implicit aspects (Chin et al., 2018) whose real semantics are unjustifiable.Similar to Latent Dirichlet Allocation (LDA, Blei et al., 2003), the semantics of the derived aspects (topics) are mutually overlapped (Huang et al., 2020b).These models undermine the resultant aspect distinctiveness and lead to uninterpretable and sometimes counterintuitive results.The root of the problem is the lack of large review corpora with aspect and sentiment annotations.The existing ones are either too small or too domain-specific (Wang and Pan, 2018) to be applied to general use cases.Progress on sentiment term extraction (Dai and Song, 2019;Tian et al., 2020;Chen et al., 2020a) takes advantage of neural networks and linguistic knowledge and partially makes it possible to use unsupervised term annotation to tackle the lack-of-huge-corpus issue.
In this paper, we seek to understand how reviews and ratings are affected by users' perception R2 [3 stars]: I love the comfort, sound, and style but the mic is complete junk! complete junk (angry) love love R3 [5 stars]: . . .But this one feels like a pillow, there's nothing wrong with the audio and it does the job. . . .con is that the included microphone is pretty bad.
pretty bad (unsatisfied) like a pillow (enjoyable) nothing wrong Table 1: Example reviews of a headset with three aspects, namely microphone quality, comfort level, and sound quality, highlighted specifically.The extracted sentiments are on the right.R1 vs. R2: Different users react differently (microphone quality) to the same item due to distinct personal attentions and, consequently, give divergent ratings.R1 vs. R3: A user can still rate highly of an item due to special attention on particular aspects (comfort level) regardless of certain unsatisfactory or indifferent properties (microphone and sound qualities).
of item properties in a fine-grained way and discuss how to utilize these findings transparently and effectively in rating prediction.We propose a twostage recommender with an unsupervised Aspect-Sentiment Pair Extractor (ASPE) and an Attention-Property-aware Rating Estimator (APRE).ASPE extracts (aspect, sentiment) pairs (ASpairs) from reviews.The pairs are fed into APRE as explicit user attention and item property carriers indicating both frequencies and sentiments of aspect mentions.APRE encodes the text by a contextualized encoder and processes implicit text features and the annotated AS-pairs by a dual-channel rating regressor.ASPE and APRE jointly extract explicit aspect-based attentions and properties and solve the rating prediction with a great performance.
Aspect-level user attitude differs from user preference.The user attitudes produced by the interactions of user attentions and item properties are sophisticated and granular sentiments and rationales for interpretation (see Section 4.4 and A.3.5).Preferences, on the contrary, are coarse sentiments such as like, dislike, or neutral.Preference-based models may infer that R1 and R3 are written by headset lovers because of the high ratings.Instead, attitude-based methods further understand that it is the comfortableness that matters to R3 rather than the item being a headset.Aspect-level attitude modeling is more accurate, informative, and personalized than preference modeling.
Note.Due to the page limits, some supportive materials, marked by " †", are presented in the Supplementary Materials.We strongly recommend readers check out these materials.The source code of our work is available on GitHub at https://github.com/zyli93/ASPE-APRE.

Related Work
Our work is related to four lines of literature which are located in the overlap of ABSA and Recommender Systems.

Aspect-based Sentiment Analysis
Aspect-based sentiment analysis (ABSA) (Xu et al., 2020;Wang et al., 2018) predicts sentiments toward aspects mentioned in the text.Natural language is modeled by graphs in (Zhang et al., 2019;Wang et al., 2020) such as Pointwise Mutual Information (PMI) graphs and dependency graphs.Phan and Ogunbona (2020) and Tang et al. (2020) utilize contextualized language encoding to capture the context of aspect terms.Chen et al. (2020b) focuses on the consistency of the emotion surrounding the aspects, and Du et al. (2020) equips pre-trained BERT with domain-awareness of sentiments.Our work is informed by these progress which utilize PMI, dependency tree, and BERT for syntax feature extraction and language encoding.

Aspect or Sentiment Terms Extraction
Aspect and sentiment terms extraction is a presupposition of ABSA.However, manually annotating data for training, which requires the hard labor of experts, is only feasible on small datasets in particular domains such as Laptop and Restaurant (Pontiki et al., 2014(Pontiki et al., , 2015) ) which are overused in ABSA.
Recently, RINANTE (Dai and Song, 2019) and SDRN (Chen et al., 2020a) automatically extract both terms using rule-guided data augmentation and double-channel opinion-relation co-extraction, respectively.However, the supervised approaches are too domain-specific to generalize to out-ofdomain or open-domain corpora.Conducting domain adaptation from small labeled corpora to un-labeled open corpora only produces suboptimal results (Wang and Pan, 2018).SKEP (Tian et al., 2020) exploits an unsupervised PMI+seed strategy to coarsely label sentimentally polarized tokens as sentiment terms, showing that the unsupervised method is advantageous when annotated corpora are insufficient in the domain-of-interest.
Compared to the above models, our ASPE has two merits of being (1) unsupervised and hence free from expensive data labeling; (2) generalizable to different domains by combining three different labeling methods.

Aspect-based Recommendation
Aspect-based recommendation is a relevant task with a major difference that specific terms indicating sentiments are not extracted.Only the aspects are needed (Hou et al., 2019;Guan et al., 2019;Huang et al., 2020a;Chin et al., 2018).Some disadvantages are summarized as follows.Firstly, the aspect extraction tools are usually outdated and inaccurate such as LDA (Hou et al., 2019), TF-IDF (Guan et al., 2019), and word embeddingbased similarity (Huang et al., 2020a).Second, the representation of sentiment is scalar-based which is coarser than embedding-based used in our work.

Rating Prediction
Rating prediction is an important task in recommendation.Related approaches utilize text mining algorithms to build user and item representations and predict ratings (Kim et al., 2016;Zheng et al., 2017;Chen et al., 2018;Chin et al., 2018;Liu et al., 2019;Bauman et al., 2017).However, the text features learned are latent and unable to provide explicit hints for explaining user interests.

Problem Formulation
Review-based rating prediction involves two major entities: users and items.A user u writes a review r u,t for an item t and rates a score s u,t .Let R u denote all reviews given by u and R t denote all reviews received by t.A rating regressor takes in a tuple of a review-and-rate event (u, t) and review sets R u and R t to estimate the rating score s u,t .

Unsupervised ASPE
We combine three separate methods to label ASpairs without the need for supervision, namely PMIbased, neural network-based (NN-based), and lan-guage knowledge-or lexicon-based methods.The framework is visualized in Figure 1.

Lexicon
Sentiment Terms (ST)

AS-pair Extractions (in green)
(Aspect 1, Sentiment 1), (Aspect 2, Sentiment 2) … < l a t e x i t s h a 1 _ b a s e 6 4 = " t N o V w m 0 r 8 R I [ < l a t e x i t s h a 1 _ b a s e 6 4 = " v W w I A P N u v Y r P d 5 H f B l / 0 f D z / y I I = " > A A A B 9 H i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 B I v g q S S i 6 L H o x Y O H i v 2 C N p T N d t I u 3 W z i 7 q S 0 h P 4 O L x 4 U 8 e q P 8 e a / c d v m o N U H A 4 / 3 Z p i Z 5 8 e C a 3 S c L y u 3 s r q 2 v p H f L G x t 7 + z u F f c P G j p K F I M 6 i 0 S k W j 7

Sentiment Terms Extraction
PMI-based method Pointwise Mutual Information (PMI) originates from Information Theory and is adapted into NLP (Zhang et al., 2019;Tian et al., 2020) to measure statistical word associations in corpora.It determines the sentiment polarities of words using a small number of carefully selected positive and negative seeds (s + and s − ) (Tian et al., 2020).It first extracts candidate sentiment terms satisfying the part-of-speech patterns by Turney (2002) and then measures the polarity of each candidate term w by Given a sliding window-based context sampler ctx, the PMI(•, •) between words is defined by where p(•), the probability estimated by token counts, is defined by p(w 1 , w 2 ) = |{ctx|w 1 ,w 2 ∈ctx}| total #ctx and p(w 1 ) = |{ctx|w 1 ∈ctx}| total #ctx .Afterward, we collect the top-q sentiment tokens with strong polarities, both positive and negative, as ST PMI .
NN-based method As discussed in Section 2, coextraction models (Dai and Song, 2019) can accurately label AS-pairs only in the training domain.For sentiment terms with consistent semantics in different domains such as good and great, NN methods can still provide a robust extraction recall.In this work, we take a pretrained SDRN (Chen et al., 2020a) as the NN-based method to generate ST NN .The pretrained SDRN is considered an off-the-shelf tool similar to the pretrained BERT which is irrelevant to our rating prediction data.Therefore, we argue ASPE is unsupervised for open domain rating prediction.
Knowledge-based method PMI-and NN-based methods have shortcomings.The PMI-based method depends on the seed selection.The accuracy of the NN-based method deteriorates when the applied domain is distant from the training data.As compensation, we integrate a sentiment lexicon ST Lex summarized by linguists since expert knowledge is widely used in unsupervised learning.Examples of linguistic lexicons include Sen-tiWordNet (Baccianella et al., 2010) and Opinion Lexicon (Hu and Liu, 2004).The latter one is used in this work.
Building sentiment term set The three sentiment term subsets are joined to build an overall sentiment set used in AS-pair generation: ST = ST PMI ∪ ST NN ∪ ST Lex .The three sets compensate for the discrepancies of other methods and expand the coverage of terms shown in Table 10 †.

Syntactic AS-pairs Extraction
To extract AS-pairs, we first label AS-pair candidates using dependency parsing and then filter out non-sentiment-carrying candidates using (ST )1 .Dependency parsing extracts the syntactic relations between the words.Some nouns are considered potential aspects and are modified by adjectives with two types of dependency relations shown in Figure 2: amod and nsubj+acomp.The pairs of nouns and the modifying adjectives compose the AS-pair candidates.Similar techniques are widely used in unsupervised aspect extraction models (Tulkens and van Cranenburgh, 2020;Dai and Song, 2019).AS-pair candidates are noisy since not all adjectives in it bear sentiment inclination.ST comes into use to filter out non-sentimentcarrying AS-pair candidates whose adjective is not in ST .The left candidates form the AS-pair set.Admittedly, the dependency-based extraction for (noun, adj.) pairs is suboptimal and causes missing aspect or sentiment terms.An implicit module is designed to remedy this issue.Open domain AS-pair co-extraction is blocked by the lacking of public labeled data and is left for future work.
We introduce ItemTok as a special aspect token of the nsubj+acomp rule where nsubj is a pronoun of the item such as it and they.Infrequent aspect terms with less than c occurrences are ignored to reduce sparsity.We use WordNet synsets (Miller, 1995) to merge the synonym aspects.The aspect with the most synonyms is selected as the representative of that aspect set.Discussion ASPE is different from Aspect Extraction (AE) (Tulkens and van Cranenburgh, 2020;Luo et al., 2019;Wei et al., 2020;Ma et al., 2019;Angelidis and Lapata, 2018;Xu et al., 2018;Shu et al., 2017;He et al., 2017a) which extracts aspects only and infers sentiment polarities in {pos, neg, (neu)}.AS-pair co-extraction, however, offers more diversified emotional signals than the bipolar sentiment measurement of AE.

APRE
APRE, depicted in Figure 3, predicts ratings given reviews and the corresponding AS-pairs.It first encodes language into embeddings, then learns explicit and implicit features, and finally computes the score regression.One distinctive feature of APRE is that it explicitly models the aspect information by incorporating a d a -dimensional aspect representation a i ∈ R da in each side of the substructures for review encoding.Let k } denotes the k aspect embeddings for users and A (t) for items.k is decided by the number of unique aspects in the AS-pair set.

Language encoding
The reviews are encoded into low-dimensional token embedding sequences by a fixed pre-trained BERT (Devlin et al., 2019), a powerful transformer-based contextualized language encoder.For each review r in R u or R t , the resulting encoding H 0 ∈ R (|r|+2)×de consists of (|r| + 2) d e -dimensional contextualized vectors:

Aspect annotations Shared aspect representation Review representation
Sentiment annotations/embeddings Computational modules/operations

REVIEW-WISE AGGREGATION ASPECT-WISE AGGREGATION
< l a t e x i t s h a 1 _ b a s e 6 4 = "

APRE -User Reviews Encoder
User-specific aspect repr.matrix < l a t e x i t s h a 1 _ b a s e 6 4 = " H W D 7 2 P c z y u Gt < l a t e x i t s h a 1 _ b a s e 6 4 = " [CLS] and [SEP] are two special tokens indicating starts and separators of sentences.We use a trainable linear transformation, h 1 i = W T ad h 0 i + b ad , to adapt the BERT output representation H 0 to our task as H 1 where W ad ∈ R de×d f , b ad ∈ R d f , and d f is the transformed dimension of internal features.BERT encodes the token semantics based upon the context which resolves the polysemy of certain sentiment terms, e.g., "cheap" is positive for price but negative for quality.This step transforms the sentiment encoding to attention-property modeling.
Explicit aspect-level attitude modeling For aspect a in the k total aspects, we pull out all the contextualized representations of the sentiment words2 that modify a, and aggregate their representations to a single embedding of aspect a in r as h (a)  u,r = h 1 j , w j ∈ ST ∩ r and w j modifies a.
An observation by Chen et al. (2020b) suggests that users tend to use semantically consistent words for the same aspect in reviews.Therefore, sum-pooling can nicely handle both sentiments and frequencies of term mentions.Aspects that are not mentioned by r will have h (a) u,r = 0. To completely picture user u's attentions to all aspects, we aggregate all reviews from u, i.e.R u , using review-wise aggregation weighted by α Now we obtain the user attention representation for aspect a, g u .The item-tower architecture is omitted in Figure 3 since the item property modeling shares the identical computing procedure.It generates the item property representations g (a) t of G t .Mutual attention (Liu et al., 2019;Tay et al., 2018;Dong et al., 2020) is not utilized since the generation of user attention encodings G u is independent to the item properties and vice versa.
Implicit review representation It is acknowledged by existing works shown in Section 2 that implicit semantic modeling is critical because some emotions are conveyed without explicit sentiment word mentions.For example, "But this one feels like a pillow . . ." in R3 of Table 1 does not contain any sentiment tokens but expresses a strong satisfaction of the comfortableness, which will be missed by the extractive annotation-based ASPE.
In APRE, we combine a global feature h 1 [CLS] , a local context feature h cnn ∈ R nc learned by a convolutional neural network (CNN) of output channel size n c and kernel size n k with max pooling, and two token-level features, average and max pooling of H 1 to build a comprehensive multi-granularity review representation v u,r : We apply review-wise aggregation without aspects for latent review embedding v u where β u,r is the counterpart of α (a) u,r in the implicit channel, w im ∈ R d im is a trainable parameter, and d im = 3d f + n c .Using similar steps, we can also obtain v t for the item implicit embeddings.
Rating regression and optimization Implicit features v u and v t and explicit features G u and G t compose the input to the rating predictor to estimate the score s u,t by F im : R 2d im → R and F ex : R 2d f ×k → R k are multi-layer fully-connected neural networks with ReLU activation and dropout to avoid overfitting.They model user attention and item property interactions in explicit and implicit channels, respectively.•, • denotes inner-product.γ ∈ R k and {b u , b t } ∈ R are trainable parameters.The optimization function of the trainable parameter set Θ with an L 2 regularization weighted by λ is J(Θ) is optimized by back-propagation learning methods such as Adam (Kingma and Ba, 2014).

Experimental Setup
Datasets We use seven datasets from Amazon Review Datasets (He and McAuley, 2016) 3 including AutoMotive (AM), Digital Music (DM), Musical Instruments (MI), Pet Supplies (PS), Sport and Outdoors (SO), Toys and Games (TG), and Tools and Home improvement (TH).Their statistics are shown in Table 2.
We use 8:1:1 as the train, validation, and test ratio for all experiments.Users and items with less than 5 reviews and reviews with less than 5 words are removed to reduce data sparsity.
Baseline models Thirteen baselines in traditional and deep learning categories are compared with the proposed framework.The pre-deep learning traditional approaches predict ratings solely based upon the entity IDs.Table 3 introduces their basic profiles which are extended in Section A.3.3 †.Specially, AHN-B refers to AHN using pretrained BERT as the input embedding encoder.It is included to test the impact of the input encoders.
Evaluation metric We use Mean Square Error (MSE) for performance evaluation.Given a test set R test , the MSE is defined by Reproducibility We provide instructions to reproduce AS-pair extraction of ASPE and rating prediction of baselines and APRE in Section A.3.1 †.The source code of our models is publicly available on GitHub4 .

AS-pair Extraction of ASPE
We present the extraction performance of unsupervised ASPE.The distributions of the frequencies of extracted AS-pairs in Figure 5 follow the trend of Zipf's Law with a deviation common to natural languages (Li, 1992), meaning that ASPE performs consistently across domains.We show the qualitative results of term extraction separately.
Sentiment terms Generally, the AS-pair statistics given in Table 9 † on different datasets are quantitatively consistent with the data statistics in Table 2 † regardless of domain.Figure 4 is a Venn diagram showing the sources of the sentiment terms extracted by ASPE from AM.All three methods are efficacious and contribute uniquely, which can also be verified by Table 10 † in Section A.3.2 †. 4 presents the most frequent aspect terms of all datasets.ItemTok is ranked top as users tend to describe overall feelings about items.Domain-specific terms (e.g., car in AM) and general terms (e.g., price, quality, and size) are intermingled illustrating the comprehensive coverage and the high accuracy of the result of ASPE.

Rating Prediction of APRE
Comparisons with baselines For the task of review-based rating prediction, a percentage in-   nificant (Chin et al., 2018;Tay et al., 2018).According to Table 5, our model outperforms all baseline models including the AHN-B on all datasets by a minimum of 1.337% on MI and a maximum of 4.061% on TG, which are significant improvements.It demonstrates (1) the superior capability of our model to make accurate rating predictions in different domains (Ours vs. the rest); (2) the performance improvement is NOT because of the use of BERT (Ours vs. AHN-B).AHN-B underperforms    hyper-parameter sensitivities to changes on internal feature dimensions (d a , d f , and n c ), CNN kernel size n k , and λ of L 2 -reg weight.
Efficiency A brief run time analysis of APRE is given in Table 6.The model can run fast with all data in GPU memory such as AM and MI, which demonstrates the efficiency of our model and the room for improvement on the run time of datasets that cannot fit in the GPU memory.The efficiency of ASPE is less critical since it only runs once for each dataset.

Case Study for Interpretation
Finally, we showcase an interpretation procedure of the rating estimation for an instance in AM: how does APRE predict u * 's rating for a smart driving assistant t * using the output AS-pairs of ASPE?
We select seven example aspect categories with all review snippets mentioning those categories.Each category is a set of similar aspect terms, e.g., {look, design} and {beep, sound}.Without loss of generality, we refer to the categories as aspects.Table 7 presents the aspects and review snippets given by u * and received by t * with AS-pairs annotations.Three aspects, {battery, install, look}, are shared (yellow rows).Each side has two unique aspects never mentioned by the reviews of the other side: {materials, smell} of u * (green rows) and {price, sound} of t * (blue rows).APRE measures the aspect-level contributions of user-attention and item-property interactions by the last term of s u,t prediction, i.e., γ, F ex ([G u ; G t ]) .The contribution on the ith aspect is calculated by the ith dimension of γ times the ith value of F ex ([G u ; G t ]) which is shown in Table 8.The top two rows summarize the attentions of u * and the properties of t * .Inferred Impact states the interactional effects of user attentions and item properties based on our assumption that attended aspects bear stronger impacts to the final prediction.On the overlapping aspects, the inferior property of battery produces the only negative score (-0.008) whereas the advantages on install and look create positive scores (0.019 and 0.015), which is consistent with the inferred impact.Other aspects, either unknown to user attentions or to item properties, contribute relatively less: t * 's unappealing price accounts for the small score 0.009 and the mixture property of sound accounts for the 0.006.
This case study demonstrates the usefulness of the numbers that add up to ŝu,t .Although small in scale, they carry significant information of valued or disliked aspects in u * 's perception of t * .This process of decomposition is a great way to interpret model prediction on an aspect-level granularity, which is a capacity that other baseline models do not enjoy.
In Section A.3.5 †, another case study indicates that a certain imperfect item property without user attentions only inconsiderably affects the rating although the aspect is mentioned by the user's reviews.

Conclusion
In this work, we propose a tightly coupled twostage review-based rating predictor, consisting of an Aspect-Sentiment Pair Extractor (ASPE) and an Attention-Property-aware Rating Estimator (APRE).ASPE extracts aspect-sentiment pairs (ASpairs) from reviews and APRE learns explicit user attentions and item properties as well as implicit sentence semantics to predict the rating.Extensive quantitative and qualitative experimental results demonstrate that ASPE accurately and comprehensively extracts AS-pairs without using domainspecific training data and APRE outperforms the state-of-the-art recommender frameworks and explains the prediction results taking advantage of the extracted AS-pairs.
Several challenges are left open such as fully or weakly supervised open domain AS-pair extraction and end-to-end design for AS-pair extraction and rating prediction.We leave these problems for future work.γ i F ex (•) i (×10 −2 ) 1.0 0.8 -0.8 1.9 1.5 0.9 0.6 Table 8: Attentions and properties summaries, inferred impacts, and the learned aspect-level contributions.

Acknowledgement
We would like to thank the reviewers for their helpful comments.The work was partially supported by NSF DGE-1829071 and NSF IIS-2106859.

Broader Impact Statement
This paper proposes a rating prediction model that has a great potential to be widely applied to recommender systems with reviews due to its high accuracy.In the meantime, it tries to relieve the unjustifiability issue for black-box neural networks by suggesting what aspects of an item a user may feel satisfied or dissatisfied with.The recommender system can better understand the rationale behind users' reviews so that the merits of items can be carried forward while the defects can be fixed.As far as we are concerned, this work is the first work that takes care of both rating prediction and rationale understanding utilizing NLP techniques.
We then address the generalizability and deployment issues.Reported experiments are conducted on different domains in English with distinct review styles and diverse user populations.We can observe that our model performs consistently which supports its generalizability.Ranging from smaller datasets to larger datasets, we have not noticed any potential deployment issues.Instead, we notice that stronger computational resources can greatly speed up the training and inference and scale up the problem size while keeping the major execution pipeline unchanged.
In terms of the potential harms and misuses, we believe they and their consequences involve two perspectives: (1) the harm of generating inaccurate or suboptimal results from this recommender; (2) the risk of misuse (attack) of this model to reveal user identity.For point (1), the potential risk of suboptimal results has little impact on the major function of online shopping websites since recommenders are only in charge of suggestive content.For point (2), our model does not involve user and item ID modeling.Also, we aggregate the user reviews in the representation space so that user identity is hard to infer through reverse-engineering attacks.In all, we believe our model has little risk of causing dysfunction of online shopping platforms and leakages of user identities.

A.3.2 ASPE: Additional Experimental Results of AS-pair Extraction
We present in Table 9 the statistics of the extracted AS-pairs of the corpora which are quantitatively consistent with the data statistics in We provide Table 10 ancillary to the Venn diagram in Figure 4 and the corresponding conclusion in Section 4.2.Table 10 illustrates the contributions of the three distinct sentiment term extraction methods discussed in Section 3.2, namely PMIbased method, neural network-based method, and lexicon-based method.All three methods can extract useful sentiment-carrying words in the domain of Automotive.Their contributions cannot overwhelm each other, which strongly explains the necessity of the unsupervised methods for term extraction in the domain-general usage scenario.Altogether they provide comprehensive coverage of sentiment terms in AM.

A.3.3 APRE: Information of Baselines
We introduce baseline models mentioned in Table 3 including the source code of the software and the key parameter settings.For the fairness of comparison, we only compare the models that have open-source implementations.
MF, WRMF, FM, and NeuMF11 Matrix factorization views user-item ratings as a matrix with missing values.By factorizing the matrix with the known values, it recovers the missing values as predictions.Weighted Regularized MF (Hu et al., 2008) assigns different weights to the values in the matrix.Factorization machines (Rendle, 2010) consider additional second-order feature interactions of users and items.Neural MF (He et al., 2017b) is a combination of generalized MF (GMF) and a multilayer perceptron (MLP).Hyper-parameter settings: The number of factors is 200.Regularization weight is 0.0001.We run for 50 epochs with a learning rate of 0.01 with the exception of MI that uses a learning rate of 0.02 for MF and FM.The dropout of NeuMF is set to 0.2.
ConvMF A CNN-based model proposed by Kim et al. (2016) 12 that utilizes a convolutional neural network (CNN) for feature encoding of text embeddings.Hyper-parameter settings: The regularization factor is 10 for the user model and 100 for the item model.We used a dropout rate of 0.2.
ANR Aspect-based Neural Recommender (Chin et al., 2018) 13 first proposes aspect-level representations of reviews but its aspects are completely latent without constraints or definitions on the semantics.Hyper-parameter settings: L 2 regularization is 1 × 10 −6 .Learning rate is 0.002.Dropout rate is 0.5.We used 300-dimensional pretrained Google News word embeddings.
DeepCoNN DeepCoNN (Zheng et al., 2017) 14 separately encodes user reviews and item reviews by complex neural networks.Hyper-parameter settings: Learning rate is 0.002 and dropout rate is 0.5.Word embedding is the same as ANR.
NARRE A model similar to DeepCoNN enhanced by attention mechanism (Chen et al., 2018).Attentional weights are assigned to each review to measure its importance.Hyper-parameter settings: L 2 regularization weight is 0.001 Learning rate is 0.002.Dropout rate is 0.5.We used the same word embeddings as described for ANR.

D-Attn 15
Dual attention-based model (Seo et al., 2017) utilizes CNN as text encoders and builds local-and global-attention (dual attention) for user and item reviews.Hyper-parameter settings: In accordance with the paper, we used 100-dimensional word embedding.The factor number is 200.Dropout rate is 0.5.Learning rate and regularization weight are both 0.001.

MPCN Multi-Pointer
Co-Attention Network (Tay et al., 2018)  of reviews by pointer networks to build the user profile for the current item.Hyper-parameter settings are the same as D-Attn except that the dropout is 0.2.
DAML DAML (Liu et al., 2019) forces encoders of the user and item reviews to interchange information in the fusion layer with local-and mutualattention so that the encoders can mutually guide the representation generation.Hyper-parameter settings are the same as MPCN.
AHN Asymmetrical Hierarchical Networks (Dong et al., 2020) 16 that guide the user representation generation using item side asymmetric attentive modules so that only relevant targets are significant.Experiments are reproduced following the settings in the paper.

A.3.4 APRE: Additional Analyses on
Hyper-parameter Sensitivity Continuing Section 3.3, the searching and sensitivity of the feature dimension (d a , d f , n c ), the CNN kernel size n k , and the regularization weight λ is exhibited in Figure 7.We always set d f = d a = n c for the consistency of internal feature dimensions.For (d f , d a , n c ) in Figure 7a, we choose values from [50,100,150,200]  The epoch numbers are stable as well.Figure 7c demonstrates how λ affects the performance.As λ becomes larger, the "resistance" against the loss minimization increases so that the training epoch number increases.However, there are no clear trends of performance fluctuation meaning that the sensitivity to L 2 -reg weight is insignificant.Finally, we evaluate the effect of adding nonlinearity to embedding adaptation function (EAF) mentioned in Section 3.3 which transforms H 0 to H 1 by h 1 i = σ W T ad h 0 i + b ad .We try LeakyReLU, tanh, and identity functions for σ(•) and report the performances in Figure 7d.Without non-linear layers, APRE is able to achieve the best results whereas non-linearity speeds up the training.

A.3.5 Case Study II for Interpretation
Finally, we show another case study from AM dataset using the same attention-property-score visualization schema as Section 4.4.In this case, our model is predicting the score user u * will give to a color and clarity compound for vehicle surface t * .The mentioned aspects of u * and the properties of t * are given in Table 11 including three overlapping aspects (quality, look, cleaning) and one unique aspect of each side (size of u * and smell of t * ).A summarization table, Table 12, shows the summarized attentions and properties, the inferred impacts, and the corresponding score components of γ, F ex ([G u ; G t ]) .In this case study, we can observe the interesting phenomenon also exemplified in Table 1 by the contrast between R1 and R3 that the aspect look, which has been mentioned by u * and reviewed negatively as a property of t * ("strange yellow color"), only produces an inconsiderable bad effect (-0.002) on the final score prediction.This indicates that the imperfect look (or color) of the item, although also mentioned by u * in his/her reviews, receives little attention from u * and thus poses a tiny negative impact on the predicted rating decision of the user.The other two overlapping aspects show intuitive correlations between their inferred impacts and the scores.The unique aspects, size and smell, have relatively small influences on the prediction because they are either not attended aspects or not mentioned properties.
It is also notable that some sentences that carry strong emotions may contain few explicit sentiment mentions, e.g., "But for an all in one cleaner and wax I think this outperforms most."It backs the design of APRE which carefully takes implicit sentiment signals into consideration, and also calls for an advanced way for aspect-based sentiment modeling beyond term level.Different proportions of such sentences in different datasets may account for the inconsistency of better performances between the two variants of the ablation study.
X U C S c P 5 I m 8 k F d r Z D 1 b b 9 b 7 v D V n Z T P 7 5 B e s j 2 + m o p I F < / l a t e x i t > ST PMI Dep.-based Rules: amod, nsubj+acomp < l a t e x i t s h a 1 _ b a s e 6 4 = " t 7 z Q j e s I F u y u 9 M b
t e x i t s h a 1 _ b a s e 6 4 = " N N n / m o i r s i b 8 d r K 5 W K 6 g e b 0 A 9 8 d 6 b T G Y S Z o I r 9 L x P y 1 5 b 3 9 j c q m x X d 3 b 3 9 g + c w 9 q z S n N J W Y e m I p W v I S g m e M I 6 y F G w 1 0 w P b Z o C W I 9 k + O u d e + d 7 j x p x J M I x v r b C y u r a + U d w s b W 3 v 7 O 6 V 9 w + 6 M k o E o R 0 S 8 U j 0 X S w p Z y H t A A N O + 7 G g O H A 5 7 b n j 5 t T v 3 V M h W R T e w i S m d o C H I f M Z w a A k p 1 y v W k A f I J 3 d r p + 2 s u z O L F l u x D 0 5 C d S T j p T g z P 1 0 0 L y 6 s b P e F P X P u W 9 I b T v N K z s y i 9 I u d e y Z b G e z k 9 N O X 1 L k t g 9 2 H A 5 f Z 5 o V u t b U b T 1 w r N w e t A L K e w 4 Y 5 R 4 J u s 3 7 p 3 5 x c K e e y / i A e x c u I t k H S g Q H r 4 j z r / 5 r O 9 X 4 R k 8 8 v b 4 O J w l L w e x Z 9 f D Y 4 / d O X Y Y c / Y c 7 b P E v a G H b M z d s 7 G j A e 7 w c v g b f A u 3 A u P w v f h y c o a B t 2 Z p 2 w t w o 9 / A I v g y k k = < / l a t e x i t > hcnn < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 H 3 y F t 4 3 7 7 s / 8 q 3 / w / + 5 t v r e Z s 1 j d C n 8 X / 8 A m d H n Y w = = < / l a t e x i t > v u < l a t e x i t s h a 1 _ b a s e 6 4 = " Z B e g i n n e t 3 w k d 5 X 7 7 v 3 w y / 9 b / 5 P / 9 f S 6 n u r N U / Q W v i / / w G q O O + E < / l a t e x i t > Gu < l a t e x i t s h a 1 _ b a s e 6 4 = "1 i M 5 l 5 4 U R p W T M 4 b 0 7 u Z G Y b F + A 8 A = " > A A A C 6 3 i c b V J N b 9 M w G H b C 1 w h f H R y 5 R F R F r Q R V g p j G c W M H d g C p C L p Na k r 0 x n U a a 4 4 d 2 U 6 1 y v J f 4 M I B h L j y h 7 j x b 3 D a o q 3 t X s n 2 4 + d 5 / P r j d V Y x q n Q U / f X 8 G z d v 3 b 6 z c z e 4 d / / B w 0 e t 3 c c n S t Q S k y E W T a 5 6 g K + H / / g e r h e + C < / l a t e x i t > v t < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 O 8 U 8 0 + e j G 0 4 P G Y x s j O g 8 p I y P 4 Q H k 1 9 C N F 6 y 5 v g + E U v e t U L 3 7 9 s 7 b 9 e H s c W e o y e o A 6 K 0 C 7 a R 0 e o j w Y I e 5 + 9 r 9 5 3 7 4 f / x f / m / / R / L a y + t 8 x 5 i F b C / / 0 P 5 t f 4 Z w = = < / l a t e x i t > H 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " N N n / m o i r s i b 8 d r K 5 W K 6 g e b 0 A 9u w = " > A A A B / H i c b V D L S s N A F J 3 4 r P U V 7 d L N Y B F c l U Q U X R b d d F n B P q C N Z T K d t E M n k z B z I 4 Z Q f 8 W N C 0 X c + i H u / B u n aR b a e u B e D u f c y 9 w 5 f i y 4 B s f 5 t l Z W 1 9 Y 3 N k t b 5 e 2 d 3 b 1 9 + + C w r a N E U d a i k Y h U 1 y e a C S 5 Z C z g I 1 o 0 V I 6 E v W M e f 3 M z 8 z g N T m k f y D t K Y e S E Z S R 5 w S s B I A 7 v S B / Y I W d 7 9 7 3 w / f P W w e v V c e y g R + g x 6 q A I v U Q H 6 B j 1 0 Q B h 7 7 P 3 1 f v u / f C / + N / 8 n / 6 v p d X 3 V j k P 0 V r 4 v / 8 B v i E B 7 A = = < / l a t e x i t > v u,r rating pred.< l a t e x i t s h a 1 _ b a s e 6 4 = " M 8 m m m G r b 8 s E D n y e h i e F k M m m H n c y E z K R s G e Z H + e I P 8 U U E H x T x 1 d / g p K n s 9 n I g y c n 3 f e e c b y 5 J w a h U v v / d c S 9 d v n L 1 2 s 7 1 1 o 2 b t 2 7 f a e / e P Z K i K j E Z Y c F E e Z K A J I x y M l J U M X J S l A T y h J H j 5 P S g 5 o 9 n p J R U 8 I 9 q X p A o h w m n G c W g L B T v O m + 6 o S J n S i / e S a Y P j f k U t L p h I l g q 5 7 n 9 6 K l F 4 k a g x w d v P 0 T G b A h 0 D / o m 1 t W j s u Z W O u 6 b m q 7

FexFigure 3 :
Figure 3: Pipeline of APRE including a user review encoder in the orange dashed box and an item review encoder in the top blue box, each containing an implicit channel (left) and an aspect-based explicit channel (right).Internal details of item encoder are identical to the counterpart of user encoder and hence omitted.
indicates the significance of each review's contribution to the overall understanding of u's attention to aspect a α ; a (u) ])) r ∈R u exp(tanh(w T ex [h (a) u,r ; a (u) ])) , where [•; •] denotes the concatenation of tensors.w ex ∈ R (d f +da) is a trainable weight.With the usefulness distribution of α (a) u,r , we aggregate the h (a) u,r of r ∈ R u by weighted average pooling:

Figure 4 :Figure 5 :
Figure 4: Sources of sentiment terms from AM.
Init.LR vs. MSE and Ep.

Figure
Figure 6: Hyper-parameter searching and sensitivity

From
reviews given by user u * .All aspects attended ().battery [To t1] After leaving this attached to my car for two days of non-use I have a dead battery.Never had a dead battery . . ., so I am blaming this device.install [To t2] This was unbelievably easy to install.I have done . . . .The real key . . . the installation is so easy.[To t3] There were many installation options, but once . . ., they clicked on easily.look [To t3] It was not perfect and not shiny, but it did look better.[To t4] It takes some elbow grease, but the results are remarkable.material [To t5] The plastic however is very thin and the cap is pretty cheap.[To t6] Great value. . . . .They are very hard plastic, so they don't mark up panels.smell [To t7] This has a terrible smell that really lingers awhile.It goes on green. . . .From reviews received by item t * .battery [From u1] The reason this won't work on an iPhone 4 or . . .because it uses low power Bluetooth, . . . .() install [From u2] Your mileage and gas mileage and cost of fuel is tabulated for each trip-Installation is pretty simple -but it . . . .() look [From u3] Driving habits, fuel efficiency, and engine health are nice features.The overall design is nice and easy to navigate.() price [From u4] In fact, there are similar products to this available at a much lower price that do work with . . .() sound [From u5] The Link device makes an audible sound when you go over 70 mpg, brake hard, or accelerate too fast.() [From u6] Also, the beep the link device makes . . .sounds really cheapy.() Dims vs. MSE and Ep.
n k vs. MSE and Ep.
EAF vs. MSE and Ep.

Figure 7 :
Figure 7: Additional hyper-parameter sensitivity and searching of internal feature dimensions (Dims: d f , d a , and n c ), CNN kernel size (n k ), regularization weight of L 2 -reg, and token embedding adaptation function.EAF is short for embedding adaptation function.

From
reviews given by user u * .quality [To t1] As soon as I poured it into the bucket and started getting ready, I can tell the product was already better quality than my previous washing liquid.look [To t4] I bought [this item] because I had neglected my paint job for too long. . . . it made my black paint job look dull.cleaning [To t2] . . .I was able to dry my car in record time and not have any water marks left on the paint.I just slide the towel over any parts with water and it left no trace of water and a clean shine to my car.[To t3] I had completely neglected these areas, except for minor cleaning and protection.Once I applied it, the difference was night and day! size [To t6] The size was great as well, allowing me to get larger areas in an easier amount of time so that I could wash my car quicker than I have in the past.

From
reviews received by item t * .quality [From u1] Adding too little soap will increase the tendency . . .This thick, high quality soap helps prevent against that.() [From u2] . . .Cons: A bit pricey, but quality matters, and this product absolutely has it.Worth every cent for sure!() look [From u3] I was a bit disappointed.It is a strange yellow color and it is thick and I personally did not care for the smell.() cleaning [From u4] As far as cleaning power it does fairly good, . . .The best cleaning of a car is in steps, but for an all in one cleaner and wax I think this outperforms most.() smell [From u5] Just giving some useful feedback about the truth behind the product . . .that it smells good.[From u6] I believe this preserves the wax layer longer . . .This is much thicker than the [some brand] soap, and has a very pleasant smell to it.() Comfortable.Very high quality sound. . . .Mic is good too.There is an switch to mute your mic. . .I wear glasses and these are comfortable with my glasses on. . . .

Table 3 :
Basics of compared baselines.Models' input is marked by " ". "U" and "T" denote Users and iTems.D-CNN represents DeepCoNN.AHN-B denotes the variant of AHN with BERT embeddings.

Table 4 :
High frequency aspects of the corpora.

Table 5 :
MSE of baselines, our model (Ours for test and Val. for validation), and variants.The row of ∆ calculates the percentage improvements over the best baselines.All reported improvements over the best baselines are statistically significant with p-value < 0.01.
a , d f , n c , and n k ), and regularization weight λ of the L 2 -reg in J(Θ).

Table 6 :
Per epoch run time of APRE on the seven datasets.The run time of AM and MI, denoted by " * ", is disproportional to their sizes since they can fit into the GPU memory for acceleration.

Table 7 :
Examples of reviews given by u * and received by t * with Aspect-Sentiment pair mentions as well as other sentiment evidences on seven example aspects.

Table 9 :
Statistics of unsupervised AS-pair extraction.

Table 10 :
Example sentiment terms of each part of the Venn diagram (Figure4) from AM dataset.We use P (PMI), N (Neural network), and L (Lexicon) to denote the produced sentiment term sets of the three methods, respectively.Operator \ denotes set minus, e.g., P ∩ L\N refers to the set of terms that are in both P and L but not in N .All sets contain commonly-used sentimental adjectives that can modify automotive items.This figure strongly explains why three methods are all necessary for term extraction in non-domain-specific use cases.They all have unique contributions to the sentiment term set for larger coverage.
since the output dimension of the BERT encoder is 256.The best performance occurs at 200.The training time spent is stable across different values.CNN kernel size n k in Figure7bvaries in[4, 6, 8, 10].We observe that generally larger kernel sizes may in turn hurt 16 https://github.com/Moonet/AHNthe performance as the local features are fused with larger sequential contexts in natural language.

Table 11 :
Examples of reviews from u * and to t * with Aspect-Sentiment pair mentions as well as other sentiment evidences on five example aspects.

Table 12 :
Attentions and properties summaries, inferred impacts, and the learned aspect-level contributions on the score prediction.