Variational Weakly Supervised Sentiment Analysis with Posterior Regularization

Sentiment analysis is an important task in natural language processing (NLP). Most of existing state-of-the-art methods are under the supervised learning paradigm. However, human annotations can be scarce. Thus, we should leverage more weak supervision for sentiment analysis. In this paper, we propose a posterior regularization framework for the variational approach to the weakly supervised sentiment analysis to better control the posterior distribution of the label assignment. The intuition behind the posterior regularization is that if extracted opinion words from two documents are semantically similar, the posterior distributions of two documents should be similar. Our experimental results show that the posterior regularization can improve the original variational approach to the weakly supervised sentiment analysis and the performance is more stable with smaller prediction variance.


Introduction
Sentiment analysis is a task of identifying the sentiment polarity expressed in textual data (Liu, 2012). Most state-of-the-art sentiment analysis methods in the literature are supervised methods which require many labeled training data. However, human annotations in the real world are scarce. While we assume there is abundant annotated data to train more and more complex models, there is still a need to consider weakly supervised methods that require less human annotation.
One way to perform weakly supervised sentiment analysis is using a predefinited lexicon (Turney, 2002;Taboada et al., 2011). A lexicon consists of many opinion words. For each opinion word, its polarity (positive or negative) and strength (the degree to which the opinion word is positive or negative) are annotated by domain experts. lexiconbased weakly supervised methods perform a dictionary lookup and assign a polarity according to all opinion words extracted from a document. A good lexicon requires high precision and high coverage, which needs a lot of human effort.
Another way to do weakly supervised sentiment analysis is using limited keywords (Meng et al., 2018;Zeng et al., 2019). Compared with lexiconbased methods, user-provided keywords require less human effort. Among keyword-based methods, there are two directions. First, (Meng et al., 2018) leveraged limited keywords to expand more keywords and generate pseudo-labeled data, and then performed self-training on real unlabeled data for model refinement. Possible improvements of this direction include investigating more advanced keywords expansion techniques to generate better pseudo-labeled samples (Miller et al., 2012) and developing more advanced self-training algorithms (Coden et al., 2014).
Second, the Variational Weakly Supervised (VWS) sentiment analysis (Zeng et al., 2019) used target-opinion word pairs as supervision signal. Its objective function is to predict an opinion word given a target word. For example, in a sentence "the room is big," "room" is a target word and "big" is an opinion word. By introducing a latent variable (the sentiment polarity), they can learn a wellapproximated posterior distribution via optimizing the evidence lower bound. The posterior probability here is the probability of a possible polarity (e.g., positive or negative) given a document, which is a typical sentiment classifier.
A potential issue with VWS is that optimizing the objective function may not guide the role of the latent variable to be sentiment polarity. For example, when half of reviews mention "big room" and half of reviews mention "small room," the latent variable is possibly related to the size of rooms, but the expected role of the latent variable is the sentiment polarity of rooms. Hence how to control and regularize the posterior distribution is very im-portant. One indirect way to control the posterior distribution is clever initialization (Ganchev et al., 2010). Originally, VWS aims to predict the sentiment polarity of each aspect for the multi-aspect sentiment analysis. So it uses the overall sentiment polarity to pretrain the model so that the posterior distribution has a good initialization. The overall polarity and polarity of each aspect are highly correlated. Thus, the initialization is highly likely to be similar to the true posterior distribution.
In this paper, we propose to use posterior regularization to regularize the VWS approach for sentiment analysis. There are two types of side information we can leverage to regularize the latent variable. First, calculating the similarity between keywords and extracted opinion words from a document can guide the model to decide which polarity that the document belongs to. Second, calculating the similarity between extracted opinion words from two documents can guide the model to decide whether two documents belong to the same polarity. The first type of side information is very easy to leverage, and it is reflected by our pretraining process. When a document is similar to keywords associated with a specific polarity, then we enforce that the posterior probability of a specific polarity should be large. In this case, in the pretraining process, we assign a pseudo label to the document. The second type of side information does not directly suggest which sentiment polarity that a document should be assigned to. It enforces pairwise constraints to the model. Our proposed posterior regularization leverage the second side information to ensure that when two documents are similar (dissimilar), the regularization enforces the posterior distribution of two documents to be similar (dissimilar).
Our contributions are summarized as follows, • We develop a posterior regularization framework for the variational weakly supervised sentiment analysis.
• The experimental results show that the proposed regularization can improve the VWS model, make the results more stable, and outperform other weakly supervised baselines.

Methodology
In this section, we first review the variational weakly supervised (VWS) sentiment analysis method in Section 2.1. Then we introduce our posterior regularization in Section 2.2.

VWS Sentiment Analysis
Before formally introducing the VWS framwork, we give a concrete example to illustrate how VWS works. Let x be the representation of a document x, e.g., bag of words or feature outputs of neural networks. Let C be a random variable, indicating the sentiment polarity of a document. The possible value assignment of C can be positive or negative, or rating from 1 to 10. Suppose there is a document x where we extract an opinion word "terrific." The objective function is to maximize the probability of opinion word "terrific." By introducing a latent variable C, the objective function is split into two probabilities, corresponding to two classifiers, namely, sentiment polarity classifier and opinion word classifier. The input of sentiment classifier is the document representation x, and it produces a probability distribution of sentiment polarity, i.e., p(C = positive|x) and p(C = negative|x). The input of opinion word classifier is extracted opinion words and the estimated sentiment polarity distribution, and it produces a probability distribution of opinion word given estimated sentiment polarity distribution, i.e., p("terrific"|C = positive) and p("terrific"|C = negative).

Sentiment Polarity Classifier
The sentiment polarity classifier aims to estimate a distribution q(C|x), where C is a discrete random variable representing the sentiment polarity of a document. Let c denote a possible value of the random variable C, representing a possible value of sentiment polarity, e.g., positive or negative. The sentiment classifier estimates the probability as where w c is a trainable vector associated with a sentiment polarity c, x is a document, and x is the document representation. The representation of a document x can be various. We use Convolutional Neural Network (CNN) in the experiment.

Opinion Word Classifier
The opinion word classifier aims to estimate the probability of an opinion word w o given a possible value of sentiment polarity c: where ϕ(·) is a scoring function taking opinion word w o and a possible value of sentiment polarity c as inputs. The nature of the scoring function is about the frequency of occurrence. If an opinion word and a possible value of sentiment polarity cooccur frequently, the score will be high, otherwise, it will be low. Specifically, we define: where w o is the trainable word embedding of opinion word w o , a c is a trainable vector associated with c. The scoring function can be various, e.g., multilayer perceptron (MLP). Here we only introduce the simplest case. Given a possible value of sentiment polarity c, VWS aims to maximize the probability of opinion words that frequently occurred with c. For example, the opinion word "good" is usually occurred with sentiment polarity positive, and the opinion word "terrible" is usually occurred with sentiment polarity negative.

Training Objective
The objective function of VWS is to maximize the log-likelihood of an opinion word w o . After introducing a latent variable (i.e., the sentiment polarity of a document) to the objective function, we can derive a variational lower bound of the log-likelihood which can incorporate two classifiers. The first one corresponds to the sentiment classifier. The second one corresponds to the opinion word classifier. The variational lower bound of log-likelihood is shown as follows: where X is the training set containing all documents, and P x is the set of all opinion words extracted from a document x, H(·) refers to the Shannon entropy, and q(c|x) is short for q(C = c|x). By applying Jensen's inequality, the log-likelihood is lower-bounded by Eq. (5). The equality holds if and only if the KL-divergence of two distributions, q(C|x) and p(C|w o ), equals to zero. Maximizing the evidence lower bound is equivalent to minimizing the KL-divergence. Hence, VWS can learn a sentiment classifier that can produce a similar distribution to the true posterior p(C|w o ). We assume that the training set is perfectly balanced, which means the prior distribution of sentiment polarity, i.e., p(C), is a uniform distribution. Hence, p(c) is a constant, which can be ignored.

Approximation
The partition function in Eq.
(3) requires the summation over all opinion words in the vocabulary.
Since the size of the opinion word vocabulary is large, VWS uses the negative sampling technique (Mikolov et al., 2013) to approximate Eq. (3). Specifically, VWS approximates p(w o |c) in the objective (3) with the following objective function: where w o is a negative sample in opinion words vocabulary, N is the set of negative samples and σ(·) is the sigmoid function. In order to ensure that the approximation part and the entropy term are on the same scale (Marcheggiani and Titov, 2016), a hyper-parameter α is added to the entropy term. The objective function becomes:

Posterior Regularization
As pointed out by (Ganchev et al., 2010), controlling the posterior distribution is crucial for models that estimate posterior distribution by maximizing the likelihood of the observed data via marginalizing over the latent variables. We need side information to regularize the posterior distribution. The side information we leveraged is that if the opinion words extracted from two documents are similar semantically, then these two documents probably are in the same class, and if the opinion words are opposite semantically, then these two documents are probably not in the same class. For example, if one document x i contains opinion words "great" and "awesome," another document x j contains opinion words "great" and "excellent," and another document x k contains opinion words "awful" and "terrible," it is highly possible that x i and x j belong to the same class because their extracted opinion words are similar semantically, and x i and x k do not belong to the same class because their extracted opinion words are opposite semantically. We formulate our posterior regularization as: , meaning the distance of two posterior distributions, and we use Euclidean distance metric; S(·, ·) is a score function which measures the similarity or dissimilarity between two sets of opinion words; O(x i ) represents all opinion words extracted from a document x i . We finally maximize Eq. (8) in the objective function.
is positive (suggesting similar), this regularization enforces the distance to be small, and when S(O(x i ), O(x j )) is negative (suggesting dissimilar), this regularization enforces the distance to be large. When S(O(x i ), O(x j )) is zero, it suggests comparison between opinion words cannot decide whether two documents are similar or not. Next, we will introduce the scoring function S(·, ·). Suppose a document x i contains a set of opinion words O( We define an operation cos(O(x i ), O(x j )) over two sets of opinion words. It will return all cosine similarity values of all valid opinion word pairs where one word must come from O(x i ) and the other must come from O(x j ). We represent opinion words using embeddings, i.e., the embeddings in the opinion word classifier in Eq. (4). If there are k opinion words in each set, cos(O(x i ), O(x j )) will return k * (k−1) 2 cosine similarity values. When we want to know whether two documents are similar in opinion words, we pay attention to the maximum value, i.e., max cos = max cos(O(x i ), O(x j )) . When we want to know whether two documents are dissimilar in opinion words, we pay attention to the minimum value, i.e., min cos = min cos(O(x i ), O(x j )) . So we define: where S(·, ·) is short for S((O(x i ), O(x j )) due to space limit. The first condition means two documents have some semantically similar opinion words (max cos > γ 1 ) and have no semantically dissimilar opinion words (min cos ≥ γ 2 ). The value returned by the function score is max cos. It should be a positive value. The second condition means two documents have no semantically similar opinion words (max cos ≤ γ 1 ) and have some semantically dissimilar opinion words (min cos < γ 2 ). The value returned by function score is min cos. It should be a negative number. The third condition means two documents have some semantically similar opinion words (max cos > γ 1 ) and also have some semantically dissimilar opinion words (min cos < γ 2 ). This condition corresponds to a real-world situation that when some customers want to express some negative sentiment, they usually point out some positive aspects first, and then start with a "but", and emphasize some negative aspects. The opinion words sets extracted from these type of documents have both negative and positive opinion words. When we compare two of them, they will have some similar opinion words and dissimilar opinion words.
In this case, we tend to assume they are in the same class. If the third condition is satisfied, it will return an non-negative value δ ∈ [0, 1]; The final condition means two documents have no semantically similar opinion words (max cos ≤ γ 1 ) and have no semantically dissimilar opinion words (min cos ≥ γ 2 ). It will return 0.
The mechanism of the regularization is that if the posterior distributions q(C|x i ) and q(C|x j ) are different from each other, i.e., d q(C|x i ), q(C|x j ) is large, but opinion words suggest that these two documents should be in the same cluster i.e., s(w i d , w j d ) is large, then d q(C|x i ), q(C|x j ) will be encouraged to be small by applying the regularization. Oppositely, if the posterior distributions q(C|x i ) and q(C|x j ) are similar, i.e., d q(C|x i ), q(C|x j ) is small, but opinion words suggest that these two documents should be in the different cluster i.e., s(w i d , w j d ) is small, then d q(C|x i ), q(C|x j ) will be encouraged to be large by applying the regularization.
The final objective function with posterior regularization is as follows, where x i and x j are documents in the training set X. The constraints are defined in a |X|×|X| space. In practice, we train our model batch by batch. So we only apply the constraints within a mini-batch. There are at most |X b | × |X b | constraints in a minibatch, where |X b | is the number of samples in a mini-batch.

Experiments
In this section, we evaluate the empirical performance of our method on binary sentiment classification tasks.

Datasets
We use three corpora to evaluate the performance of our proposed method. All corpora have two classes and perfectly balanced. For all methods, we use a development set for hyper-parameter tuning. For all methods, we use the training set as the test set since all methods do not use the ground truth in the training set.
(1) Yelp Review: We use the Yelp reviews polarity dataset from (Zhang et al., 2015) and take its test set containing 38,000 documents as the corpus for evaluation. For hyper-parameter tuning, we also extract 3,800 documents from the original training set of (Zhang et al., 2015) to serve as a development set.
(2) IMDB Review: We use the IMDB reviews polarity dataset from (Maas et al., 2011) and randomly extract 20, 000 reviews from its original test set as the corpus for evaluation. For hyperparameter tuning, we also extract 2, 000 documents from the original training set of (Maas et al., 2011) to serve as a development set.
(3) Amazon Review: We use the Amazon reviews polarity dataset from (Zhang et al., 2015) and randomly extracted 20, 000 reviews from its original test set as the corpus for evaluation. For hyper-parameter tuning, we also extract 2, 000 documents from the original training set of (Zhang et al., 2015) to serve as a development set. Table 1 provides the details of these datasets.

Compared Methods
Lexicon uses an opinion lexicon to assign sentiment polarity to a document (Read and Carroll, 2009;Pablos et al., 2015). We combine two popular opinion lexicons used by (Hu and Liu, 2004) and (Wilson et al., 2005) to get a larger lexicon. If an extracted opinion is in the positive (negative) lexicon, it votes for positive (negative). When the opinion word is with a negation word such as "no" and "not", its polarity will be the opposite. Then, the polarity of a document is determined by using majority voting among all extracted opinion words. When the number of positive and negative words is equal, the document will be randomly assigned a polarity.
WeSTClass (Meng et al., 2018) first generates pseudo labels for documents which contain userprovided keywords. Keywords are expanded to generate more pseudo samples. It pretrains a CNN/LSTM model using pseudo samples as the training set and then performs a self-training process. Here, we use CNN because it empirically outperforms LSTM. The CNN architecture we used here is the same as the one described in (Meng et al., 2018). Keyword Pretrain generates pseudo labels for  VWS-PR is VWS method with proposed posterior regularization.

Keywords and Opinion Word Extraction
We manually select three keywords for each class. The details of keywords of three datasets are shown in table 2.
For opinion word extraction, we adopt four rules proposed by VWS (Zeng et al., 2019) in the implementation. All rules rely on dependency parser (Chen and Manning, 2014). When a target word and an opinion word satisfy a dependency relation, we will extract the opinion word. The details of dependency relation and examples are provided in Table 3. When a pair of words satisfy one rule, there are still some restrictions on head and tails words to be satisfied. There is no restriction for Rule 1. For Rule 2, the head word should be an adjective and the tail word should a noun. For Rule 3, the head word should be one of the following four words: "like," "dislike," "love," and "hate." For Rule 4, the head word should be one of the following word: "seem," "look," "feel," "smell," and "taste." Table 4 shows that our method VWS-PR outperforms VWS by 4%, 2%, and 1% on Yelp, IMDB, and Amazon datasets respsectively. Compared with WeSTClass and VWS, our method is more stable, i.e., smaller standard deviation, which shows that the regularization confine the posterior distribution to a smaller space. The performance of lexicon method is bad across three datasets. The main reason is that it does not involve any learning process. Keyword pretraining method can outperform lexicon method. But pseudo labels are not ground truths, hence the pseudo training set contains noises. Also, user provided keywords are limited, so the training samples with pseudo labels are restricted to some samples which contain certain keywords. For example, reviews with an extreme polarity (only expressing positive polarity or only express negative polarity) are likely to be pseudo samples. But most of reviews express mixed polarities. This will hinder the generalization ability. WeSTClass outperforms the keyword pretraining method on Yelp and Amazon dataset due to keyword expansion and self-training process. But in IMDB dataset, WeSTClass is slightly worse than the keyword pretraining method. Possible reason would be keyword expansion involve some harmful keywords and self-training procedure amplifies errors. VWS outperforms WeSTClass on IMDB and Amazon datasets and is comparable to Yelp dataset.

Hyper-parameters Sensitivity Analysis
We first show F1 scores on three datasets with varied β in Figure 1(a). It shows that optimal β values of our model on three datasets are different. When they achieve optimal β value, the standard deviation is much smaller than others. The regularization makes models more stable. Our method on IMDB and Amazon is not very sensitive. The changes are within 2%. Our method on Yelp is more sensitive. But we could still find a range, e.g., 0.1 to 0.5, where the changes are within 2%. When β keeps growing, the performance in Yelp deteriorates a lot.

Rule
Dependency 1  adjectival modifier  they have delicious food  delicious  2  nominal subject  the room is big  big  3  direct object  i like it  like  4 open clausal complement i feel comfortable comfortable   The main reason is probably that the opinion word vocabulary size in Yelp is much larger than other datasets, and hence it is likely that the vocabulary in Yelp contains more noisy opinion words. When β is large, the noises may harm the performance. We then show F1 scores on three datasets with varied γ 1 in Figure 1(b). The optimal γ 1 values of three datasets are the same, i.e., 0.7. The trends are consistent on three datasets. F1 score first increases and then decreases. When γ 1 is small, the constraints are easier to satisfy and the performance is bad because it may involve more noisy constraints. It probably enforces two samples to be similar but in reality, they are not similar. When γ 1 is large, the performance is also bad because it has fewer constraints that enforce two samples to be similar. It could have made use of more constraints.

Datasets
We show F1 scores on three datasets with varied γ 2 in Figure 1(c). The optimal γ 2 values of three datasets are the same, i.e., −0.1. The trends are consistent on three datasets. F1 score first increases and then decreases. When γ 2 is small, the performance is bad because it has fewer constraints that enforce two samples to be dissimilar. It could have made use of more constraints. When γ 2 is large, the performance is bad because it has more noisy constraints.

Error Analysis
We show some incorrectly predicted documents by VWS-PR on three datasets in Table 5. For the first document, the customer emphasizes price a lot. Our method cannot extract opinion words on snippets such as "not at that price" and "for half the price." For the second document, the reviewer loves this movie because he/she loves the basketball player. The reviewer thinks that the movie itself does not deserve a high score. Our method detects both positive and negative polarities on this document, so it tends to predict as negative. Because most mixed polarities are likely to be negative polarity. The regularization enforces this pattern. This document obviously is different from other documents with mixed opinion words. For the last document, our method cannot extract opinion words on snippets such as "no wrist strap" and "without a place to attach a wrist strap." Our method is good at extracting words on subjective expression such as "nice light," but not on descriptive expression such as "no wrist strap." Our method fails because no other knowledge source indicates that "no wrist strap" is negative.

Implementation Details
For WeSTClass and VWS, we used code released by (Meng et al., 2018) and (Zeng et al., 2019) respectively, and followed their preprocessing steps and optimal settings. For VWS and VWS-PR, we pretrain a CNN model using pseudo-labeled samples. After that, the embeddings are untrainable. The rest of parameters are trainable. For our method, the hyperparameter settings of VWS part

Dataset Document Prediction Ground Truth Extracted Words
Yelp It was good, but not at that price. good bad good There are so many other good italian places in the area for half the price.

IMDB
I love gheorghe muresan, bad good love so i automatically loved this movie.
good Everything else about it was so so.
annoying Billy crystal is a good actor, even if he is annoying.

Related Work
In this section, we review the related work on weakly supervised sentiment analysis.
Using a lexicon is a typical way to perform weakly supervised sentiment analysis. One line of works perform simple assignment, i.e., majority voting, based on sentiment orientation scores of extracted opinion words. Some methods (Missen and Boughanem, 2009;Tsytsarau et al., 2010) used sentiment orientation scores in existing lexicons directly, and aggregated them within a document to determine polarity. Some methods developed their own semantic orientation estimation algorithm. For example, (Turney, 2002) first identified phrases in the review and then estimated the semantic orientation of each extracted phrases. The semantic orientation of a given phrase is calculated by comparing its similarity to a positive reference word ("excellent") with its similarity to a negative reference word ("poor"). This method determined the sentiment polarity based on the average semantic orientation of the phrases extracted from the review. (Kamps et al., 2004) used the minimum path distance between a phrase and pivot words ("good" and "bad") in WordNet to estimate the semantic orientation of extracted phrases.
Another line of works involve learning process when using a lexicon. (Li et al., 2009;Zhou et al., 2014) proposed a constrained non-negative matrix tri-factorization approach to sentiment anal-ysis, and used a sentiment lexicon as prior knowledge. In these models, a term-document matrix is approximated by three factors that specify soft membership of terms and documents in one of k classes. All three factors are non-negative matrices. The first factor is a matrix representing knowledge in the word space, i.e., each row represents the posterior probability of a word belonging to the k classes. The second factor is a matrix providing a condensed view of the term-document matrix. The third factor is a matrix representing knowledge in document space, i.e., each row represents the posterior probability of a document belonging to the k classes. (Li et al., 2009) applied a regularization to encourage that the first factor is close to prior knowledge. This regularization is different from ours because it requires prior knowledge such as a predefined lexicon. A predefined lexicon needs a lot of human effort. (Zhou et al., 2014) applied a regularization based on an intuition that if two documents are sufficiently close to each other, they tend to share the same sentiment polarity. This intuition of this regularization is similar to ours. But when they compare document similarity, they use textual similarity (e.g., cosine similarity of bag of words) rather than similarity on opinion words. The regularization is applied under matrix factorization framework, it is not straightforward to fit in neural network based models.
Using keywords is another way to perform weakly supervised sentiment analysis. (Meng et al., 2018) leveraged keywords to generate pseudolabeled samples for model pretraining, and then performed self-training on unlabeled data for model refinement. The possible improvement of this direction would be investigating more advanced keywords expansion techniques to generate better pseudo-labeled samples and developing a more advanced self-training algorithm. LOTClass (Meng et al., 2020), a parallel work to ours, fine-tuned a masked language model to generate relevant words that can replace label name such as "good" and "bad," and performed self-training on unlabeled data for model refinement. Fine-tuning in LOT-Class can be viewed as an advanced keyword expansion process using language models. VWS (Zeng et al., 2019) used target-opinion word pairs as supervision signal. Its objective function is to predict an opinion word given a target word. By introducing a latent variable (the sentiment polarity), they can learn a well-approximated posterior dis-tribution via optimizing the evidence lower bound. The posterior probability here is the probability of a possible polarity (e.g., positive or negative) given text representation.

Conclusion
We propose a posterior regularization framework for the VWS sentiment analysis to better control the posterior distribution. The intuition behind the posterior regularization is that if extracted opinion words from two documents are semantically similar (dissimilar), the posterior distribution of two documents should be similar (dissimilar). Our experiments show that our posterior regularization can improve VWS and the performance is more stable.