DNN-driven Gradual Machine Learning for Aspect-term Sentiment Analysis

Recent work has shown that Aspect-Term Sentiment Analysis (ATSA) can be performed by Gradual Machine Learning (GML), which begins with some automatically labeled easy instances, and then gradually labels more challenging instances by iterative factor graph inference without manual intervention. As a non-i.i.d learning paradigm, GML leverages shared features between labeled and unlabeled instances for knowledge conveyance. However, the existing GML solution extracts sentiment features based on pre-speciﬁed lexicons, which are usually inaccurate and incomplete and thus lead to inadequate knowledge conveyance. In this paper, we propose a Deep Neural Network (DNN) driven GML approach for ATSA, which exploits the power of DNN in feature representation for gradual learning. It ﬁrst uses an unsupervised neural network to cluster the automatically extracted features by their sentiment orientation. Then, it models the clustered features as factors to enable implicit knowledge conveyance for gradual inference in a factor graph. To leverage labeled training data, we also present a hybrid solution that fulﬁlls gradual learning by fusing the inﬂuence of supervised DNN predictions and implicit knowledge conveyance in a uniﬁed factor graph. Finally, we empirically evaluate the performance of the proposed approach on real benchmark data. Our extensive experiments have shown that the proposed approach consistently achieves the state-of-the-art performance across all the test datasets in both unsupervised


Introduction
Aspect-Term Sentiment Analysis (ATSA) aims at inferring the sentiment polarity towards a particular aspect in a sentence (Hu and Liu, 2004;Pontiki et al., 2016). ATSA is important for many r i s ij text r 1 s 1.1 service was awful mostly because staff were overwhelmed. s 1.2 The staff should be a bit more friendly. r 2 s 2. 1 We ordered lamb which was perfectly cooked and tasted awesome.
The food was well-prepared and presented. applications (e.g., e-commerce and social media), where the sentimental opinions in reviews can be leveraged to create value for businesses and customers. In ATSA, an aspect-term, also called target, is explicitly mentioned in a review. For instance, consider the running example shown in Table 1, r 1 evaluates the restaurant through two explicit aspects service and staff. The goal of ATSA is then to detect the respective sentiment polarities expressed towards these two aspects.
Up to now, the state-of-the-art solutions for ATSA have been built upon various DNN models. The earlier solutions were usually equipped with an attention mechanism (Tang et al., 2016b;Wang et al., 2016;Tang et al., 2016a;Ma et al., 2017;Chen et al., 2017;Wang et al., 2018;Tang et al., 2019). They mostly attempted to learn aspect-related semantic representation of an input sentence. Recently, ATSA has experienced a considerable shift towards pre-trained language models Tang et al., 2019;Karimi et al., 2020). Despite the effectiveness of these approaches, unfortunately their efficacy heavily relies on large quantities of accurately labeled data, which require intensive human labor.
To alleviate the burden of manual labeling, a solution based on the paradigm of Gradual Machine Learning (GML) has recently been proposed for ATSA (Wang et al., 2021). First proposed for entity resolution in (Hou et al., , 2020, GML can enable effective machine labeling without the requirement for manual intervention. Given a classification task, GML begins with some easy instances, which can usually be automatically labeled by the machine with high accuracy, and then gradually reasons about the labels of its more challenging instances by factor graph inference. As a non-i.i.d (Independent and Identically Distributed) learning paradigm, GML leverages shared features between labeled and unlabeled instances for knowledge conveyance . However, the existing GML solution for ATSA relies on pre-specified lexicons to extract sentiment features. Its limitation is twofold: 1) sentiment lexicons may be inaccurate and incomplete; 2) a shared feature must explicitly appear in both instances. However, explicit features cannot capture the implicit similarity between instances and thus lead to inadequate knowledge conveyance. Consider the running example in Table 1. Unfortunately the word well-prepared is not included in most of the existing lexicons. It can also be observed that the instances s 2.1 and s 2.2 do not share any explicit feature, while perfectly cooked and well-prepared have very similar meanings and can thus serve as an implicit common feature.
Recently, DNN models have been proven to be very powerful in feature representation for many NLP tasks, where the features with the same semantic context are mapped to close points in the latent space (Devlin et al., 2018). For instance, the words "cooked" and "well-prepared" are usually represented by two points close to each other because they are semantically very close. Unfortunately, the existing embedding models are designed to map the features semantically, regardless of their sentiment orientation. Therefore, they may map two features with opposite polarities (e.g., "good" and "bad") to two close points in the embedding space, which raises a challenge to be directly applied to feature extraction for ATSA.
In this paper, we propose a novel DNN-driven GML approach for ATSA. It essentially exploits DNN to sentimentally map the features of aspectterms into different polarity indicators, and models them as shared factors in a factor inference graph to enable implicit knowledge conveyance. To this end, we first combine the sentiment lexicon and dependency parser-based relations, which are readily available, to generate aspect-opinion words. Secondly, we use an unsupervised neural network to filter the aspect-irrelevant and unsentimental words from an input sentence. Finally, the resulting weighted sentences, which can be considered to be purely sentimental, are used to learn polarity indicators. The model is trained to reconstruct the weighted sentence through a linear combination from polarity indicators. To leverage labeled training data, we also present a hybrid GML solution that fulfills gradual learning by fusing the influence of supervised DNN predictions and implicit knowledge conveyance in a unified factor graph.
Our main contributions can be summarized as follows: 1. We propose a DNN-driven GML approach for ATSA, which can effectively exploit the power of DNN in feature representation for GML; 2. We present an unsupervised attention-based neural network to cluster the features of aspect-terms by their sentimental orientation; 3. We present a hybrid GML solution for ATSA, which fulfills gradual learning by fusing the influence of supervised DNN predictions and implicit knowledge conveyance in a unified factor graph.
4. We empirically validate the efficacy of the proposed approach on benchmark data. Our extensive experiments have shown that the proposed approach consistently achieves the state-of-the-art performance across all the test datasets in both unsupervised and supervised settings and the improvement margins are considerable.

Related work
Aspect-Term Sentiment Analysis is a sub-task of aspect-based sentiment analysis, which aims to detect the sentiment polarity in response to a particular aspect in a sentence (Hu and Liu, 2004;Pontiki et al., 2016). Traditional machine learning techniques (Kiritchenko et al., 2014;Castellucci et al., 2014) proposed to learn SVM classifiers based on different features such as n-grams, non-contiguous n-grams and lexicon features. In comparison, the DNN-based models equipped with an attention mechanism have been shown to be more effective on ATSA (Tang et al., 2016b;Wang et al., 2016;Tang et al., 2016a). Following this trend, researchers have resorted to more sophisticated attention mechanisms to refine neural ATSA models (Ma et al., 2017;Chen et al., 2017;Wang et al., 2018;Tang et al., 2019). To improve performance, they essentially attempted to explicitly capture the importance of each context word by learning aspect-related representation of an input sentence. SenHint  proposed to integrate DNN predictions and linguistic hints in a joint framework. Recently, ATSA has experienced a considerable shift towards pretrained language models Tang et al., 2019;Karimi et al., 2020). Unfortunately, the efficacy of these models heavily relies on labeled training data, which may not be readily available in real-scenario. From unsupervised perspective, earlier solutions (Alvarez-López et al., 2016;Hutto and Gilbert, 2014) proposed to detect the polarities of aspect-terms based on lexicon rules. The authors of (Schouten et al., 2017) proposed a mechanism of spread activation for aspect-based polarity detection. More recently, the authors of (Wang et al., 2021) proposed an unsupervised solution based on GML for ATSA. However, the existing GML solution extracts features based on sentiment lexicons, which may not be accurate nor complete and thus lead to inadequate knowledge conveyance.
The idea of mapping features into different clusters has been investigated with different purposes. The authors of (Iyyer et al., 2016) proposed to learn a set of descriptors representing the fictional relationship between two characters changes over time, and (He et al., 2017) proposed to learn a set of aspect representatives from the corpora. Unfortunately, none of them investigated how to cluster implicit features by their polarity orientation.

Task Definition
We formulate the task of aspect-term sentiment analysis as follows: Definition 3.1 [Aspect-Term Sentiment Analysis]. Let x = (r, s, t) be a target unit, where r denotes a review, s a sentence in the review and t an aspectterm associated with the sentence. Given a set of target units, X, the goal of ATSA is to infer the sentiment polarity of each target unit in X.

GML Paradigm Overview
Our solution is built upon the non-i.i.d learning paradigm of GML (Hou et al., , 2020. As shown in Figure 1, GML consists of the following three steps:

Easy Instance Labeling.
Given a classification task, it is usually very challenging to accurately label all the instances in the task without good-coverage training examples. However, the work can become much easier if we only need to automatically label some easy instances in the task. In real scenarios, easy instance labeling can be performed based on the simple user-specified rules or the existing unsupervised learning techniques. GML begins with the observations provided by the labels of easy instances. Therefore, high accuracy of automatic machine labeling on easy instances is critical for its ultimate performance on a given task.
For ATSA, this paper uses the unsupervised algorithm of spread activation (Schouten et al., 2017) to label easy instances. An instance is considered as easy if its resulting dominate label meets a prespecified threshold.

Feature Extraction and Influence
Modeling.
Feature serves as the medium for knowledge conveyance. This step extracts the common features shared by labeled and unlabeled instances. To facilitate effective knowledge conveyance, it is desirable that a wide variety of features are extracted to capture as much information as possible. For each extracted feature, this step also needs to model its influence over the labels of its relevant instances. For ATSA, we extract two types of features: sentimental feature and relational feature. Relational feature, which has been well studied in (Wang et al., 2021), represents the explicit sentimental connection between sentences within the same review. In the running example, due to the absence of any shift word between s 11 and s 12 , their polarities can be supposed to be similar. In this paper, we focus on how to enable implicit knowledge conveyance by leveraging DNN for automatic extraction of sentimental features.

Gradual Inference.
This step gradually labels the instances with increasing hardness in a task. Since the scenario of gradual learning does not satisfy the i.i.d assumption, gradual learning is fulfilled from the perspective of evidential certainty. Gradual learning is conducted over a factor graph, which consists of the labeled and unlabeled instances and their common features, by iterative inference. At each iteration, it chooses to label the unlabeled instance with the highest degree of evidential certainty. The iteration is repeatedly invoked until all the instances in a task are labeled.
Given a factor graph, G, GML defines the probability distribution over its variables V as follows: (1) where F v denotes the set of sentimental features associated with the variable v, F denotes the set of relational features, φ f (v) denotes the factor associated with v and f , φ f (v i , v j ) denotes the factor associated with the relational feature f , and Z is a partition function, i.e. normalization constant. To effectively learn the factor weights without access to the true labels of unlabeled variables, V I , GML minimizes the negative log marginal likelihood given the observed labels of labeled variables, Λ, as follows: ( A scalable approach for gradual inference on ATSA has been presented in (Wang et al., 2021). First, the unlabeled variables are sorted according to their evidential support. Then, the top-m unlabeled variables are considered as the candidates for probability inference. To reduce the invocation frequency of factor graph inference, an efficient algorithm is used to approximate entropy estimation on m candidates and select the top-k most promising variables for factor graph inference. Finally, the probabilities of the selected k variables are inferred in the subgraphs of G. Since the inference process of the DNN-driven GML is very similar to what was presented in (Wang et al., 2021), its technical details are omitted here due to space limit.

DNN-driven GML
In this section, we first present an unsupervised neural network to extract implicit sentimental features. Then, we describe the unsupervised DNN-driven GML that integrates implicit features into the process of gradual inference. Finally, we describe the hybrid GML solution that fuses the influence of DNN predictions and implicit features for gradual learning.

Implicit Feature Extraction by DNN
The purpose of implicit feature extraction is to learn a set of polarity indicator embeddings I ∈ R k×d , where k is the number of indicators, which can be leveraged to capture the similar features between instances. Each indicator represents a set of features that often occur in the contexts with the same polarity.
Specifically, for each input sentence s and its aspect term t, we first generate a set of aspectopinion words, denoted by A s . Then, we use A s to construct a weighted vector v s ∈ R d that can be read as the sentimental representation of the input sentence given the target of t. To this end, we propose an attention-based unsupervised neural network to filter the sentence by down-weighting aspect-irrelevant and unsentimental information. The model is trained by reconstructing v s as a linear combination of indicator embeddings from I. The architecture of the proposed DNN has been presented in Figure 2.

The Input
The input to our model is a couple of sentence s and its aspect-opinion words A s . We use dependencybased parse tree to generate aspect-opinion words (modifiers) (Hu and Liu, 2004), then leverage the adjective words and those detected by the lexicon to construct A s . Considering the running example in Table 1, the sentiment words of s 11 are awful and overwhelmed. Suppose that we have a feature embedding matrix L ∈ R c×d , where c is the vocabulary size and d is the embedding dimension. Each word is then associated with a real-valued embedding v i ∈ R d from L representing its feature vector (Mikolov et al., 2013): where s ∈ R n×d and n is the sentence length, while the input sentiment A s ∈ R m×d and m is the number of aspect-opinion words.

Attention-based Sentimental Representation
For each input sentence s, we construct a weighted vector v s to capture the sentimental information in response to the aspect t. To this end, we apply two attention mechanisms to filter away the irrelevant information. The first one attempts to down-weight non-sentimental words, while the second one is a self-attention to attend to aspect-relevant information (He et al., 2017). Specifically, the first attention layer takes both the sentence s and its opinion words A s as an input. Conceptually, we first compute the global sentiment vector v a by averaging the word embeddings of A s , and then use it to weight each word embedding v s i in s as follows: where the symbol · stands for element-wise dot product, while U ∈ R d×d is the transformation matrix (i.e., to be learned during training) between the global sentiment vector v a and the input sentence s. Next, we apply a softmax layer to yield a non-negative weight for each word in s as follows: where the value of α s i can be read as the probability of each word in the sentence s being a sentiment word.
Although we have computed the sentimental importance for each word in s, but not all the sentiment words are contextually related to the aspect. Therefore, we apply another self-attention mechanism that takes only the sentence s as input. To compute each word's probability of being aspectrelevant information, namely β s i , we follow the same steps in the first attention layer. The only difference is that the global sentiment v a in Equation 6 is replaced by the global context, which is simply computed by averaging the word embeddings of the input sentence s itself.
Finally, we sum both attention layer outputs, α s ∈ R n and β s ∈ R n , and use it to construct the weighted vector v s as follows: in which the resulting weight vector v s can be read as the aspect-relevant sentiment representation of the input sentence s.

Unsupervised Training
Now that we have obtained the aspect-relevant sentiment representation of an input sentence s, we explain how to learn its polarity indicators using a variant of dictionary learning. Considering the matrix of indicators I as a dictionary, we attempt to approximate v s as a linear combination of items from I. Formally, for each aspect-specific vector v s , we compute a corresponding vector v k s over k polarity indicators by simply reducing v s from d dimensions to k dimensions through a softmax layer as follows: where W ∈ R k×d denotes a weight matrix and b denotes the bias, both of which are supposed to be learned during training. Note that v k s can be read as the probability that the input sentence s belongs to each indicator. Then, we reconstruct the representation vector by taking a weighted average over the polarity indicators as follows: Since the objective is to make v I s similar to v s , we apply the widely used contrastive max-margin objective function (Iyyer et al., 2016;Ahmed et al., 2020). To that end, we randomly sample some sentences from the training data as negative samples, and compute the weighted vector v s for each sample using Equation 8.
Specifically, the objective is a hinge loss that minimizes the inner product between the representation vector of v s and the reconstructed vector v I s for the negative samples, while simultaneously maximizes their inner product for other samples in the training data. Formally, the hinge loss is defined as: To discourage the model from learning similar indicators, we add a regularization term to the objective function J that penalizes redundancy in the matrix of polarity indicators (Iyyer et al., 2016): where Y denotes the identity matrix. The final training objective L is then represented by the weighted sum of J and M as follows: where λ is a hyper-parameter that controls the magnitude of the regularization term.

Unsupervised GML Solution
Now that we have already learned a set of polarity indicators, we describe how they can serve as implicit features for gradual inference. Given a sentence, we first estimate its aspect-specific vector v s using Equation 8, then compute the cosine similarity between v s and each polarity indicator in I. It thus results in a list of scores in the form (indicator index, similarity score) representing how the aspect-term's features are close to each indicator. We sort the scores and use top-k corresponding indicators as representative features. We scale up the similarity score to 10 to augment the number of features and meanwhile avoid polarity conflict between features. As shown in Figure 3, f 5.4 represents an indicator feature with the index of 5 and the similarity scale of 4.
Considering the instance of s 11 in Figur 3, s 11 's top-5 indicators and their similarity scores are (5, 0.44), (29, 0.2), (15, 0.16),(4, 0.13) and (9, 0.13). Then, its representative features are F s 11 = {f 5.4 , f 29.2 , f 15.1 , f 4.1 , f 9.1 }. In gradual inference, we restrict two instances to share an implicit feature if and only if they are similar to the same indicator with the same score scale. For instance, in Figure 3, s 11 and s 12 share the indicators 4 and 5 with the same score scale of 1 and 4 respectively.
Given ATSA task, each aspect-term within the same review is represented by a variable. The evidence variables are assigned constant values 0 or 1 representing their polarity labeling, while the values of the inference ones are inferred based on G. The factor of an implicit feature f e in Equation 1 is defined by: where v i denotes a variable having the feature f e , and w fe denotes the weight of f e . Note that the weight w fe is initialized to zero, but needs to be learned in the process of gradual inference.

Hybrid GML Solution
In the hybrid solution, we model the influence of DNN outputs by DNN factors, denoted by f n , as shown in Figure 3. In this paper, we have implemented the hybrid solution by the state-of-the-art BERT-based DNN of HP-SUM for ATSA (Karimi et al., 2020). However, other DNN models can be fused in the same way. Since supervised learning is usually more accurate than unsupervised learning, we also label easy instances by supervised DNN predictions. In other words, we consider the instances with the most extreme probabilities predicted by HP-SUM as easy ones to kick-start gradual inference. In factor graph, the DNN factor f n of a variable corresponding to the aspect-term unit, (r, s, t) is defined by:  in which w fn denotes factor weight. The value of w fn is defined as where p is the probability output of DNN (i.e., estimated by HP-SUM) of a target t being positive in the sentence s. It can be observed that W fn > 0 if p > 0.5; otherwise, if p < 0.5, then w fn < 0.

Empirical Evaluation
We evaluate our solution on six benchmark datasets provided by the SemEval ABSA task across the years 2014, 2015 and 2016 for the Restaurant and Laptop domains (Pontiki et al., 2016). Note that the original datasets are three-way labels (i.e., positive, negative and neutral). Since this paper focuses on binary polarity classification, we only include the reviews with positive or negative labels in our experiments. Furthermore, we have trained the polarity indicators for the restaurant and laptop domains on unlabeled corpus collected from Citysearch and Amazon, which have also been widely used in previous work (Zhao et al., 2010;Ahmed et al., 2020). For unsupervised training, we initialized word vectors by word2vec. We implemented GML inference using the Numbskull library 1 , a Python NUMBA-based Gibbs sampler. Our GML implementation optimizes the parameters by Adam with 1 https://github.com/HazyResearch/numbskull the learning rate of 0.001. On all the test datasets, we set the number of polarity indicators k to 50, and the number of negative samples to 20. In the spread activation algorithm for easy instance labeling, the easiness threshold is set to 0.7 for all datasets. For the hybrid GML solution, the easy instances are the top-30% ones with most extreme probabilities as predicted by supervised DNN. For each instance, the associated implicit features are the top-5 polarity indicators' scores scaled to 10.
Note that among the listed DNN models, the last 3 models (i.e. AEN-BERT, BERT-SPC, and HP-SUM) were built upon the latest pre-trained BERT.

Main Results
We average the three runs' performances and report the detailed evaluation results in Table 2. We have the following observations: (1) the unsupervised DNN-driven GML consistently gives the best accuracy compared to the unsupervised alternatives across all datasets. The performance advantage of the DNN-driven GML over the lexicon-based GML suggests that a carefullydesigned implicit feature mechanism can effectively perform better than lexicon-based explicit features for ATSA; (2) the unsupervised DNNdriven GML is even competitive with the traditional supervised attention-based models; (3) the supervised BERT-based approaches indeed achieve better performance than both traditional attentionbased DNNs and unsupervised GML. However, their efficacy depends on the fine-tune phase that requires an access to the labeled training data, which are not available in the unsupervised setting; (4) The hybrid GML solution consistently achieves the state-of-the-art performance across all datasets. It improves the best performance by almost 2% on two datasets and 1%-2% on four out of six datasets. In light of the well recognized challenge of ATSA, these improvements are indeed considerable. Illustrative Examples. To illustrate the effectiveness of implicit features, we present the features of the running example in Table 3. It can be observed that overwhelmed, well-prepared, and presented in r 11 and r 22 respectively are not captured by the lexicon, and r 12 contains the context misunderstanding of friendly. Even though r 11 and r 12 do not share any explicit information, the negative context of friendly is very close to overwhelmed; they thus share the implicit features f 4.1 and f 5.4 . Likewise, well-prepared in r 22 is very close to perfectly cooked in r 21 , and they share the implicit features f 7.4 and f 15.3 .

Effectiveness of Sentiment Weighting
We illustrate the effectiveness of the designed attention mechanisms in terms of attending to aspectrelevant sentiment information, and understanding the context. We retrieve samples from the datasets and visualize their attention weights in Figure 4, in which the deeper the color, the more importance a word has. It can be observed that the aspect-opinion words are weighted among the others and the model effectively attends to the sentiment words that are not in the lexicon (e.g., personable, gracious in (a) and (c) respectively). Since the sentiment words dominate the sentence representation in Equation 8, this indeed encourages the model to sentimentally learn the representations of polarity indicators.

Conclusion
In this work, we propose a novel DNN-driven GML approach for ATSA that can effectively leverage common implicit features for knowledge conveyance. Our extensive experiments have shown that the proposed approach consistently achieves the state-of-the-art performance in both unsupervised and supervised setting. For future work, it is noteworthy that the DNN-driven GML approach is potentially applicable to other classification tasks; the technical solutions however need further investigation. Funding: The work was supported by the National Key Research and Development Program of China (2018YFB1003400), National Natural Science Foundation of China (61732014, 61672432,