Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge

In this paper, we investigate the Aspect Category Sentiment Analysis (ACSA) task from a novel perspective by exploring a Beta Distribution guided aspect-aware graph construction based on external knowledge. That is, we are no longer entangled about how to laboriously search the sentiment clues for coarse-grained aspects from the context, but how to preferably find the words highly related to the aspects in the context and determine their importance based on the public knowledge base. In this way, the contextual sentiment clues can be explicitly tracked in ACSA for the aspects in the light of these aspect-related words. To be specific, we first regard each aspect as a pivot to derive aspect-aware words that are highly related to the aspect from external affective commonsense knowledge. Then, we employ Beta Distribution to educe the aspect-aware weight, which reflects the importance to the aspect, for each aspect-aware word. Afterward, the aspect-aware words are served as the substitutes of the coarse-grained aspect to construct graphs for leveraging the aspect-related contextual sentiment dependencies in ACSA. Experiments on 6 benchmark datasets show that our approach significantly outperforms the state-of-the-art baseline methods.


Introduction
Aspect category sentiment analysis (ACSA) aims to detect the sentiment polarity for a coarse-grained aspect category from a given sentence. Different from the task of target-dependent or aspect term sentiment analysis, whose target or term explicitly occurs in the sentence, the aspect category in ACSA does not necessarily occur in the sentence. Here, the aspect category (hereinafter also referred to as aspect) generally consists of an entity E and an attribute A (i.e. E#A) or only an entity E. As shown in Figure 1, in sentence "This place is pricey, but the pizza is yummy.", there are two aspects mentioned in the sentence: "RESTAURANT#PRICES" (negative) and "FOOD#QUALITY" (positive).
Many existing research efforts focus on ACSA with deep learning methods to attend the significant information for the aspect category in sentiment prediction (Wang et al., 2016;Cheng et al., 2017;Liang et al., 2019a,b;Li et al., 2020a;Chen et al., 2020;Li et al., 2020b;Liang et al., 2020a). Despite promising progress made by existing methods, they are generally entangled about how to search the sentiment clues of coarse-grained aspects from the context. However, making sense of the aspectoriented sentiment words from the context purely based on the implicit aspects is a daunting task. This mostly due to 1) aspect categories generally do not manifest in the context, and 2) multiple aspects and sentiment polarities may be mentioned in the same context. On the contrary, we can exploit the aspect-related words that explicitly occur in the sentence to model the contextual sentiment information for the aspect. As the examples shown in Figure 1, there are some aspect-related words (e.g. "place", "pricey", "pizza" and "yummy") in the sentence, allowing us to explicitly leverage the sentiment dependencies with these words for identifying the sentiment polarities of the aspects.
Motivated by this, we investigate the ACSA task from a novel perspective by proposing to construct aspect-aware graph(s) for the context with respect to the corresponding aspect. More concretely, we regard the distinct aspect as the distinct pivot and then search the aspect-related words from external knowledge, called aspect-aware words, which are served as the substitutes of the coarse-grained aspect to construct graph of the context for the specific aspect. That is, external knowledge is deployed as a bridge between implicit aspect category and the context, so as to skillfully and actively build connections between highly aspect-related context words and the specific aspect by means of a graph construction. In addition, inspired by many previous graph-based methods (Yao et al., 2019;Qin et al., 2020;Liang et al., 2020b;Qin et al., 2021b,a;Zhang et al., 2021;Liang et al., 2021a), weights of edges in a graph are important for graph information aggregation. Moreover, based on our empirical study (as shown in Figure 3 and 4), the contributions of aspect-aware words to the aspect are obviously different. For example, the aspectaware word "place" is more important than "pizza" to the aspect entity "RESTAURANT". Following that the main challenge of the idea evolves into how to determine the importance of aspect-aware words for the corresponding aspect, which can be leveraged as the weights of edges in a graph for learning the sentiment clues of the aspect.
In the light of the knowledge base, vividly, a word can connect or not to an aspect via various routes, the successful connection probability (corresponding to the weight of an edge in a graph) can be naturally regard as a Binomial Distribution. We hence examine the weights of edges via modeling all the probabilities of successful connection possibility based on the prior knowledge (routes and connection information) of external knowledge by means of Beta Distribution (Gupta and Nadarajah, 2004), which is the Conjugate prior distribution of Binomial Distribution. In this way, all the probabilities of aspect-aware words that connecting to the aspect could be investigated, so as to determine the optimum confidence probability (weight) of the aspect-aware word, called aspect-aware weight. Subsequently, we construct aspect-aware graph(s) for each context with respect to the aspect based on the aspect-aware words paired with their weights.
Based on it, an aspect-aware graph convolutional network (AAGCN) structure is proposed to draw contextual sentiment dependencies to the aspect for ACSA. The main contributions of our work are summarized as follows: (i) The ACSA task is approached from a novel perspective that learning how to find the aspectaware words that highly related to the aspect and educe their importance to the aspect, so as to construct a graph with these words for learning the contextual sentiment features in ACSA.
(ii) A novel scenario of modeling all the importance probabilities of aspect-aware words with Beta Distribution is deployed to educe the aspect-aware weights for constructing the knowledge enhanced aspect-aware graph.
(iii) An aspect-aware graph convolutional network is proposed to draw contextual sentiment dependencies to the aspect for sentiment detection and achieves state-of-the-art performance.

Related Work
Previous studies in ACSA task largely pay attention to straightforwardly extract the contextual sentiment for coarse-grained aspect categories. Wang et al. (2016) proposed an attention-based LSTM model for selectively attending the regions of the context representations. Xue and Li (2018) exploited a gated convolutional neural network to selectively extract aspect-specific sentiment information for sentiment prediction. Xing et al. (2019) explored an aspect-aware LSTM to incorporate aspect information into LSTM cells for ACSA. In multitask learning methods, Li et al. (2020b) adopted aspect category detection task to aggregate the sentiment for the aspect from the context. Chen et al. (2020) modeled document-level sentiment preference with cooperative graph attention networks for document-level ACSA. Cai et al. (2020) explored a hierarchical graph convolutional network to model the inner-and inter-relations for aspects in sentiment prediction.
In addition, to enhance the learning ability of the model, there are a series of studies that incorporate external knowledge into the framework (Ma et al., 2018;Tian et al., 2020;Tong et al., 2020;Liang et al., 2021b). Among them, Tian et al. (2020) modeled sentiment information at the word, polarity, and aspect level into pre-trained sentiment representation in sentiment analysis based on the automatically-mined knowledge.  exploited semantic and emotion lexicons as a bridge to enable knowledge transfer across different targets in cross-target stance detection. In targeted aspect-based sentiment analysis, Ma et al. (2018) exploited affective knowledge to extend the classic LSTM cell for simultaneously learning a target-specific attention and a global attention.

Methodology
In this section, we describe our proposed aspectaware graph convolutional network (AAGCN) in detail. As illustrated in Figure 2, our proposed model consists of three primary components: 1) Aspect-aware words derivation, which generates a distinct series of affective words for the distinct aspect from external knowledge. 2) Aspect-aware graphs construction, which constructs aspect-aware graphs of the context based on aspect-aware words.
3) Aspect-aware sentiment learning, which extracts the aspect-related sentiment dependencies based on aspect-aware graphs and context representations.

Task Definition
Given a sentence s consists of n words s = {w 1 , w 2 , · · · , w n } and the corresponding aspect a, which may not occurs in ({w i }|i = 1, 2, .., n). The goal of aspect category sentiment analysis is to detect the sentiment polarity (i.e. Positive, Negative, or Neutral) of the aspect from the context. Here, each aspect may consist of an entity E and an attribute A (i.e. E#A) or only an entity E.

Aspect-aware Words Derivation
To construct contextual sentiment dependency graph for the aspect that does not occur in the sen- tence, we explore a novel scenario that regarding an aspect as the pivot and deriving the aspect-aware words by searching the words that are highly associated with the aspect from the external affective knowledge within a certain number of hops. To be specific, if words contain direct relations with the aspect, then these words are the 1-hop aspectaware words. Correspondingly, if words contain relations with the 1-hop aspect-aware words, then those words are the 2-hops aspect-aware words, etc. In addition, we seek the aspect-aware words for the entity E and the attribute A respectively if an aspect consists of E#A since the roles of E and A are generally different in sentiment detection.
In this scenario, intuitively, the main challenge is to determine the affective importance of each aspect-aware word with respect to the aspect. Overall, the hop number is the roughly important impact. However, as shown in Figure 3 (a) and (b), the yellow dot with 2-hops, which contains only a unique link, is more important than the green one that simultaneously connects to many other irrelevant words. For each word it either connects to the aspect within κ−hop or not, there is a potential Beta Distribution for each aspect-aware word that reveals the distribution of the correlation degree to the aspect. Thus based on the priori knowledge  learned by the external knowledge, we employ Beta Distribution, which is generally adopted to model all success probabilities of an experiment, to educe the importance ρ(w i ) of each aspect-aware word: where CDF(f (µ i ; α, β)) represents the Cumulative Distribution of f (µ i ; α, β). Here µ i represents the unrelated probability of the aspect-aware word w i towards the aspect, C a i is the neighbor count of w i in the knowledge and C s i is the count of aspectaware neighbors. N κ is the vocabulary size of κ−hop aspect-aware words and N is the vocabulary size of the whole corpus. γ 1 and γ 2 are the coefficients to control the influence of the unrelated neighbors and the hop number. That is, we consider the influence of both the unrelated neighbors and the hop number when deriving the aspect-aware weight. Since as the examples depicted in Figure 4, the aspect-aware word "yummy" is more important than "red" with respect to the aspect "food", although its hop number is greater. f (θ; α, β) denotes the Beta Distribution of all importance probabilities θ, which is defined as: where B(·) is Beta function for normalization. Here α and β denote the parameters of the Beta Distribution towards the aspect which are learned by the prior knowledge from the external knowledge: Based on it, we can derive a decent aspect-aware weight for each aspect-aware word. In addition, we set the aspect-aware weights of the aspect itself and each irrelevant word as 1 and 0 respectively.

Aspect-aware Graph Construction
In this section, we describe the novel solution of constructing contextual dependency graphs with respect to the aspects, granted that aspects do not occur in the sentence. Based on the aspect-aware words and their aspect-aware weights, we compute the edge weight of each word pair of the aspectaware graph as follow: Here inspired by many previous graph-based studies Huang and Carley, 2019;Liang et al., 2020b), we also employ dependency tree of the sentence to better capture the syntactical relations 1 . That is, we add 1 to the edge weight of A i,j if w i and w j contain dependency in the dependency tree of the sentence. Then we construct the undirected graph to enrich the affective and dependency information: A i,j = A j,i , and also set a self-loop for each word: A i,i = 1.

Aspect-aware Sentiment Learning
For each sentence, we first retrieve the embedding of each word in the sentence from the embedding lookup table V ∈ R m×N . Thus for a sentence with n words, we can get the corresponding embedding matrix X = [x 1 , x 2 , · · · , x n ], where x i ∈ R m is the word embedding of w i , which are fine-tuned during the training process. m is the dimension of the embedding. Afterward, the embedding matrix X is fed as input into the bidirectional LSTM (Bi-LSTM) layers to derive the hidden contextual representations of the sentence: where h t ∈ R 2m represents the hidden representation at time step t derived by the Bi-LSTM layers. Based on it, we feed the aspect-aware graph(s) of the sentence and the hidden contextual representations H into the aspect-aware GCN to draw contextual sentiment dependencies to the aspect. For the aspect that consists of E#A, we employ a novel interactive GCN block to capture the potential interaction between entity and attribute. Each node in the l-th GCN block is updated according to the hidden representations of its neighborhoods in the adjacency matrices of entity and attribute graph, the process is defined as: where g l−1 is the hidden representation evolved from the preceding GCN block.Ã is a normalized symmetric adjacency matrix: where E i = n j=1 A i,j is the degree of A i . Here, the original input nodes of the first GCN block are retrieved from the hidden representations learned by Bi-LSTM layers, i.e. g 0 = H. In addition, for the aspect that only consists of E, the aspectaware GCN updates with Eq. (8). Then inspired by , we adopt a retrieval-based attention mechanism to capture the significant contextual aspect-related sentiment clue: Hence, the final representation of the aspect-aware sentiment features is formulated as follow: where softmax(·) is the softmax function to obtain the output distribution.

Model Training
The objective of our task is to train the classifiers by minimizing the cross-entropy loss between predicted and ground-truth distribution: Where S is the training size, C is the number of classes.ŷ is the ground-truth distribution of sentiment. λ is the weight of the L 2 regularization term. Θ denotes all trainable parameters.

Dataset and Experiment Setting
We conduct experiments on 6 benchmark datasets to verify the effectiveness of our proposed model 2 .  (Jiang et al., 2019). Each sample consists of the sentence, aspect, and the sentiment polarity towards the aspect. The statistics of the datasets are shown in Table 1. Following (Cai et al., 2020), for the datasets without development sets, we randomly select 10% of the training set as the development data to tune the hyper-parameters 3 . For non-BERT models, we use GloVe (Pennington et al., 2014) to initialize each word into 300dimensional embedding. The hidden vector dimension is 300. The GCN blocks number is 2. The coefficients of γ 1 and γ 2 are 0.4 and 0.6, λ is 0.00001, which are the optimal hyper-parameters in the pilot studies. Adam is utilized as the optimizer with a learning rate of 0.001 and a mini-batch of 16. We apply a dropout of 0.3 after the embedding layer. For BERT-based models, we use the pre-trained uncased BERT-base (Devlin et al., 2019) with 768dimensional embedding 4 , and the learning rate is 0.00002. SenticNet (Cambria et al., 2020), which contains affective commonsense relations between words, is employed to derive aspect-aware words in this work. We set the max hop number to 5. The   (Yin et al., 2020), with † are retrieved from (Li et al., 2020b), with ‡ is retrieved from (Xing et al., 2019), with are retrieved from (Cai et al., 2020), with § are retrieved from (Jiang et al., 2019) .
reported results are averaged scores of 10 runs to obtain statistically stable results.
(2) To show the generalizability of our method, another external knowledge (ConceptNet (Speer et al., 2017)), which contains concept relations between words, is employed to produce aspect-aware words. Then two comparison models are derived, i.e. AAGCN-c and AAGCN-BERT-c.
(3) To evaluate the significance of the Distribution exploited in our proposed method, we design two variants of our model without Distribution. That is, "AAGCN-one" and "AAGCN-hop", whose aspect-aware weights are respectively computed as ρ(w i ) = 1 and ρ(w i ) = 1 κ i . (4) To demonstrate the effectiveness of Beta Distribution for determining aspect-aware weights, we also perform other three related Distributions with the proposed AAGCN. Including Binomial Distribution (AAGCN-BD), whose aspectaware weight is defined as ρ( r! e −µ i , and Gamma Distribution (AAGCN-GD), whose aspect-aware weight is defined as ρ(w i ) = 1 − CDF gamma(µ i ; α, β) .
We also set several varieties of our proposed AAGCN to analyze the impact of different components in the ablation study. "w/o ρ+D" denotes constructing fully connected graph for each sen-  tence, that is, each word pair contains an edge. "w/o ρ" denotes without aspect-aware words and "w/o D" denotes without dependency tree.

Experiment Results
As shown in Table 2, the experimental results on 6 datasets demonstrate that our proposed model performs consistently better than the comparison models for both non-BERT and BERT-based models and for both E#A and E aspects. This verifies the effectiveness of our proposed model in ACSA.
Compared with models without employing Distributions to derive aspect-aware weights, the performance is overall improved in any distribution. This denotes that exploring Distributions to model the successful connection probability between words and the corresponding aspect is more adaptive to derive more valuable aspect-aware weights from external knowledge. In addition, the results produced by different distributions show that our proposed AAGCN, which explores Beta Distribution to determine aspect-aware weights, outstandingly outperforms several related distributions. This implies that deploying Beta Distribution to model all the probabilities of successful connection probability for aspect-aware words based on the priori knowledge learned from external knowledge derives more sound aspect-aware weights and leads to an improved ACSA performance.
For different external knowledge scenarios, both AAGCN and AAGCN-c perform overall better than the baselines, which demonstrates the generalizability of our proposed method in deriving aspectaware words. In addition, compared with models based on ConceptNet, models with SenticNet reveal considerable superiorities for both non-BERT and BERT-based conditions. This indicates that SenticNet, which contains affective relations can advance the model to leverage sentiment information and achieves better performance in ACSA.

Ablation Study
To investigate the impact of different components in our proposed model bring to the performance, we conduct an ablation study and report the results in Table 3. Note that both fully connected graph and removal of the aspect-aware words reduces the performance seriously. This verifies the significance and effectiveness of recognizing aspectaware words from the context for constructing graph in ACSA task. Additionally, model that without employing dependency tree leads to slightly poor performance, which implies that incorporating syntactical relations into the graph can further lead to the improved ACSA performance.

Impact of Hop Numbers
To investigate the impact of different hop numbers when deriving aspect-aware words from external knowledge, we vary them from 1 to 8 and report the results in Figure 5. Note that as the hop number increases from 1 to 5 the performance  improves steadily on all datasets, and the curves erratically fluctuate when the hop is greater than 5. This implies that the significant learning advantage brought by aspect-aware graph relies on an appropriate amount of aspect-aware words, while excessively extending the hop numbers for searching aspect-aware words may bring noise. Thus we set the hop number as 5 in our model.

Impact of GCN Blocks
To analyze the impact of the layer number of GCN blocks over the performance of our proposed model, we conduct experiments by varying the layers from 1 to 6 and show the results in Figure 6. Note that 2-layer GCN blocks performs overall better, thus we set the layer number of GCN blocks as 2 in our experiments. Comparatively, 1-layer GCN block performs unsatisfactorily, which potentially indicates that 1-layer GCN block is insufficient to leverage precise aspect-related sentiment information from the context. In addition, the performance fluctuates with the increasing layer number of GCN blocks and evidently tends to decline when the layer number is greater than 4. This implies that roughly increasing the depth of GCN block is  Figure 7: Covering rate of aspect-aware words (a). Examples of aspect-aware words distribution (b), white spaces denote aspect-aware words in the context. vulnerable to slash the learning ability of the model due to the sharp increase of model parameters.

Analysis of Aspect-aware Words
To investigate the appearance of aspect-aware words in the sentence, we report the covering rates of aspect-aware words on different datasets in Figure 7 (a). Note that the coverage rate of aspectaware words in all datasets exceeds 95%. That is, more than 95% sentences contain aspect-aware words. This validates the hypothesis that aspectrelated words generally serve as sentiment descriptions of the corresponding aspect in the sentence, and verifies the convincingness and significance of our proposed method in ACSA task. Further, we randomly select 50 sentences from REST15 dataset and show the distribution of aspect-aware words in Figure 7 (b). Note that almost all the sentences contain an appropriate amount of aspect-aware words. This impliedly indicates that aspect-aware words are generally occur as key clues in the sentences. We show some typical aspect-aware words paired with their weights derived for aspect word "food" in Figure 7 (c). Note that 1) the words that highly associated with "food" are with great weights (the red examples), 2) the common sentiment words are with average weights (the green examples), 3) the irrelevant words are with small weights (the blue examples). This qualitatively verifies that our proposed method of deploying Beta Distribution to derive aspect-aware weight is effective in ACSA.

Case Study
To qualitatively demonstrate how contextual aspectaware words work in ACSA task, we visualize the aspect-aware weights in Figure 8. Although the aspect (both E and A) of Example (a) is non-existent in the sentence, the sentiment clue of the aspect can be easily learned with the help of the aspect-aware words. Example (b) and (c) are two instances containing multiple aspects, in which, the entity "food" occurs in the sentence of Example (b) while none of the aspect occurs in the sentence in Example (c). Note that the significant contextual words with respect to the distinct aspect can be extracted and distinguished for learning aspect-related sentiment expression with the help of aspect-aware words.

Conclusion
In this paper, we investigate the Aspect Category Sentiment Analysis (ACSA) task from a novel perspective that learning how to preferably find the aspect-aware words that are highly related to the aspects, and educe their weights with Beta Distribution based on the public knowledge. The aspectaware words paired with their weights are deployed to construct aspect-aware graph(s) of the context for learning the contextual sentiment dependencies in ACSA with a graph convolutional structure. Experimental results on 6 benchmark datasets demonstrate the effectiveness of our proposed method.