Is “hot pizza” Positive or Negative? Mining Target-aware Sentiment Lexicons

Modelling a word’s polarity in different contexts is a key task in sentiment analysis. Previous works mainly focus on domain dependencies, and assume words’ sentiments are invariant within a specific domain. In this paper, we relax this assumption by binding a word’s sentiment to its collocation words instead of domain labels. This finer view of sentiment contexts is particularly useful for identifying commonsense sentiments expressed in neural words such as “big” and “long”. Given a target (e.g., an aspect), we propose an effective “perturb-and-see” method to extract sentiment words modifying it from large-scale datasets. The reliability of the obtained target-aware sentiment lexicons is extensively evaluated both manually and automatically. We also show that a simple application of the lexicon is able to achieve highly competitive performances on the unsupervised opinion relation extraction task.


Introduction
Sentiments of words can be subtle. We are used to using the same word to express different emotions in different contexts. "Hot", for example, suggests a negative sentiment when commenting a computer hardware and a positive sentiment when commenting a pizza, even itself alone is identified without any general orientation. In these situations, it is the composition of a word, contexts, and commonsense carries an opinion. Automatically detecting such context dependent sentiments would strengthen both our understanding of implicit opinions in languages and improve existing sentiment analyses models, which is the main topic of this work.
To handle shifts of word sentiment, prior works studied how to adapt existing sentiment lexicons to new domains (Hamilton et al., 2016;Xing et al., Figure 1: Visualization of real-world commonsense sentiment of "hot" and "long" extracted by our framework. Red and blue indicate the targets in restaurant and electronic domains, respectively. 2019). By modeling differences and similarities of text topics, they can detect new sentiments of words as the domain changes. The basic assumption of those domain-level sentiment lexicons is that a word keeps a consistent sentiment within a domain. This assumption, however, might be strong for fine-granularity analyses of text sentiments: words (especially, neural words such as "long", "fast") could exhibit different orientations even in the same domain ( Figure 1). To collect more detailed information of a sentiment, another branch of works (aspect-based sentiment analysis (Pontiki et al., 2014;Zhou et al., 2020a,b), opinion relation extraction (Sun et al., 2017)) attempt find answers of "who express what opinion on which target" for opinion bearing texts. Existing solutions heavily rely on manual annotations and linguistic rules, which are either hard to scale-up or hard to be complete.
In this work, we study the task of extracting target-aware sentiment lexicons. An entry of such lexicon is a pair of a sentiment word and a target word, and their collocation expresses a sentiment. It improves existing domain-dependent lexicons by being more concrete and accurate on describing opinions. Departing from approaches adopted in existing aspect-based analyses, we aim to build context-aware lexicons by minimizing the requirement of annotations (e.g., only document-level sentiment labels) and errors from handcrafted patterns. Our method starts from a target word (e.g., an aspect in product reviews), and extract sentiment words from its local context. The main strategy is to perturb context words and see how the sentiment of the target word changes: words with high influence on the target's sentiment hold high probability of forming a collocation with the target. We accomplish this by observing the behaviour of a well-trained document-level sentiment classifier when we change the contexts of the target word. Two types of perturbations are examined, discrete perturbation which only requires a black-box classifier, and continuous perturbation which asks for network gradients. We collect evidences of each candidate pair on large datesets to ensure the reliability of the final lexicon. Finally, the polarities of a lexicon entry can also be obtained by querying the sentiment classifier.
On two online product review domains (electronic and restaurant), we evaluate the extracted target-aware lexicon both manually and automatically. Quantitative and qualitative results show that the lexicons are reasonable to reflect common sentiment usage in each domain. As an application, we apply the lexicons to the task of unsupervised opinion relation extraction. The model performs significantly better than the baseline extractor, and even competitive with a recent supervised model on restaurant reviews. We summarize main contributions as follows, • We propose to extend general purpose opinion lexicons with target constraints which provides a finer view on word-level sentiments.
• We develop a scalable approach to automatically mine target-aware sentiment lexicon from texts without extensive annotations and elaborated linguistic rules.
• Besides manual evaluations, we propose an automatic way to evaluate the extracted lexicon with downstream tasks.
• We are able to achieve significant improvements on unsupervised opinion relation extraction task with the help of the new lexicons.

Definitions and the Task
Let d be a document with sentences s 1 , s 2 , ..., s |d| and y ∈ Y be the sentiment label of d.  Figure 2: The process of distant supervision.
and a target word t (e.g., screen, pizza), 2 our task is to extract targetaware opinion words of t only using documentlevel sentiment labels. Precisely, we aim to output a set of triples (t, o, p), where o is an opinion word commonly used to comment target t and p ∈ R |Y | is the distribution of its sentiment orientation.
We develop the lexicon extractor in three steps. First, we build an approximate target-level sentiment classifier (Section 3) using document-level sentiment labels. Second, for each sentence s containing target t, we calculate how important a word w ∈ s is on helping the classifier correctly predicting s's polarity (Section 4.1 and 4.2). We aggregate scores of w over all its occurrences to get its confidence of being an opinion word of t. Finally, we derive the polarity of w by querying the classifier with template sentences (Section 5).

Approximating Target-oriented Opinion
To identify target-aware opinion words, our key approach is to inspect how the opinion of a target changes when its context words change. Hence, it is crucial to know the polarity of a target in documents. However, annotations in D are documentlevel: for a document, its sentiment label expresses overall sentiments for all targets in the document, rather than a specific one. For example, the restaurant review in Figure 2 talks about 5 targets, each of them is commented by different opinion words with different polarities. In one of our datasets, 93% of documents contain multiple sentences (6 in average), and more than 82% contain multiple targets (7 in average). Therefore, directly using document-level sentiment labels could be inappropriate for target-level analyses. On the other hand, it is quite expensive to annotate target-level sentiments, and existing datasets are far from enough for a robust commonsense opinion extractor.
To deal with this problem, we borrow the idea of distant supervision (Mintz et al., 2009): if a document is labelled as positive, at least one sentence (target) in it is positive. By seeing a large amount of positive documents, a classifier may be able to generalize patterns of their positive sentences, thus may help finding sentence-level (target-level) opinions. Here we simply build a document-level sentiment classifier, and apply it on sentences to get pseudo target-level sentiment labels (for simplicity, we assume one sentence contains one target). Advanced distant supervision models could also be applied, but we find this simple method preforms quite well in our experiments.
To build the sentiment classifier, we fine-tune BERT (Devlin et al., 2019) on D to encode domain specific semantics and augment it with a sentiment prediction task to encode sentiment information. For a document d, we feed its word sequence into BERT and obtain a vector representation d = BERT(d), then we apply a softmax operator on d to get the probability of its sentiment P (y|d), where W c , b c are new parameters for the sentiment classification task. The loss function is the crossentropy between the predicted probability and the true label, For each sentence s containing t, we apply above classifier to predict pseudo sentiment label y p of s. In the following sections, we will rely on the set S t = {(s, y p )|t ∈ S} to extract target-aware opinion words of t.

Importance Scores
We propose two score functions for measuring a context word w's influence on the target-oriented sentiment: one is discrete perturbation which only requires outputs of the sentiment classifier, another is continuous perturbation which needs network gradients. They are also called model-free and The CPU is very hot.

Sentiment Classifier
The CPU is very hot.  Figure 3: The possibility of the sentence is super negative changed from 0.940 to 0.002 when the word "hot" is deleted.
model-based methods, respectively. Both of them are simple and easy to compute given the trained model, and thus suitable for large-scale collective analyses.

Discrete Perturbation
A well-trained sentiment classifier should correctly capture correlations between sentence words and sentence polarities. Intuitively, an opinion word (of the target) would have high influence on the sentiment distribution P (y p |s). For example, in Figure 3, "hot" is more informative than "The" for predicting the sentence's negative label.
In order to see whether a word w affects P (y p |s), we perturb the sentence s by removing w from it (denoted by s −w ) and examine the output differences, The larger σ f (w, s) is, the more P (y p |s −w ) changes, and the more important w for getting the right sentiment label. We will use σ f (w, s) as an indicator of target-aware opinion words, and aggregate them on D. Let S w t ⊆ S t be the set of sentences which t and w co-occur, we average σ f (w, s) on S w t to get the model-free importance score σ f (w), In order to reduce the affect of noise and rare language usage, we take co-occurrence statistic into account: a target-aware opinion word should cooccur with the target often. Therefore, the average score is empirically scaled with their co-occurrence probability P (w|t) = The score σ f (w) is model-free in the sense that we don't need to know details of the sentiment classifier and only inquire the difference of outputs when the input sentence is perturbed. Hence, though we use the BERT-based classifier here, we can use any other off-the-shelf sentiment classifiers (e.g., pre-trained models with different training objectives, multi-task learned classifiers, etc.) to further enrich (or constrain) the score.

Continuous Perturbation
Besides the discrete perturbation setting, we could also utilize the full classification model to identify target-aware opinion words. In this continuous perturbation setting, we ask the same question of how the sentiment prediction will change when we perturb sentence words. However, instead of perturbing them discretely (i.e., removing a word), we can perform continuous perturbations on word vectors (Goodfellow et al., 2015).
Let L(y p , s, w) = − log P (y p |s) be the loss on sentence s and w is the word vector of w. If we slightly perturb w to w with w − w ≤ ε, we can bound the absolute change of the loss function using the first-order approximation of L(y p , s, w), The magnitude of the gradient's norm ∇ w L(y p , s, w) could be a sign of how sensitive the sentiment label is with respect to w: to get the right prediction we will prefer not to perturb those words with large gradient norms. Therefore, a large gradient norm may also indicate an opinion words of the target. Define Similar to Equation 3, we collect all σ b (w, s) in S w t and scale their average with co-occurrence probability. The model-based score of w is defined as, Finally, the computation of both discrete perturbation and continuous perturbation could be done efficiently using auto-gradient tools. The discrete perturbation setting requires a forward process of the network, while the continuous perturbation setting needs an additional backward computation. We also note that the "perturb-and-see" strategy behind both scores characterizes the relation between opinion words and the target only through the sentiment label, which is an indirect way. As a consequence, though the scores could recognize "big" implies a negative opinion on "battery", it could also identify "not" in "the battery is not big" as an important word for the positive opinion. In practice, we could filter out such cases by rules, but how to explicitly handle semantic composition in importance scores would be an important future work.

Polarity Inference
Given the importance scores of words with respect to t, we can rank them accordingly and take the top-k words as t's opinion lexicon. As the final step, we are left to determine the polarity of an opinion word o. We accomplish this by building template sentences which try to carry the semantic like "what opinion on which target". We call these sentences template which will be use to probe the sentiment classifier's knowledge on (t, o)'s polarity.
Formally, define T to be a set of templates, each template τ ∈ T takes an opinion word and a target as input, outputs a natural language sentence τ (t, o). Here, we use the following two templates, • τ (t, o) = "The t is o." (e.g., τ (battery, big) = "The battery is big.").
By feeding τ (t, o) into the sentiment classifier, we obtain P (y p |τ (t, o)), and the polarity distribution p of (t, o) is averaged over all templates,

Experimental Results and Analyses
We wish to evaluate the merit of our target-aware sentiment lexicon in this section. We first introduce the experimental setup in Section 6.1. Then, we design detail experiments to answer the following key questions.  Table 1: The results of human evaluation on L and L c over electronic and restaurant. DP and CP mean discrete perturbation and continuous perturbation, respectively. Q1 Can we trust our target-aware sentiment lexicon? To evaluate the quality of the extracted lexicon, we test the performance with both manual evaluation (Section 6.2) and automatic downstream task (Section 6.3).
Q2 Useful or not? As an application, we apply our lexicon into unsupervised opinion extraction task in Section 6.4.
Q3 Do we really understand our model? In Section 6.5, to investigate the insight of commonsense sentiment mined from the texts, we visualize several real-world examples.

Experimental Setup
We conduct experiments to validate the effectiveness of our approach on two widely different domains: electronic and restaurant, taken from Amazon dataset 3 and Yelp Challenge 2015 4 . We obtain the target set from SemEval'14, SemEval'15, and SemEval'16 for convenience 5 . The extracted target-aware sentiment lexicon (L) can be divided into target-aware general sentiment lexicon (L g ) and commonsense sentiment lexicon (L c ). L g means the opinion words in L that are in general lexicon and L c means the opinion words in L that are not in general lexicon. Here, we use the general lexicon from (Hu and Liu, 2004) to filter the general sentiment words and obtain the commonsense lexicon. This general lexicon contains around 6800 positive and negative opinion words or sentiment words for the English language.
We adopt BERT base as the basis for all experiments. Adam (Kingma and Ba, 2015) is adopted as the optimizer with learning rate 5e-5 for fine-tuning and sentiment classification.

Human Evaluation
To evaluate the quality of the target-aware sentiment lexicon, we test its performance through human evaluation. For quantitative evaluation, we sample 50 targets with top-20 opinion words in each domain to investigate the performance of L and L c . Finally, we obtain 3122 and 2877 (t, o) pairs after filtering repetitive pairs for electronic and restaurant, respectively. We ask ten annotators to label them to make sure each pair is marked with three times. Then, we obtain the label through voting. We calculate the Krippendorff's alpha coefficient (Krippendorff, 2011) to measure the interannotator agreement of the manual annotation. The value is 0.850 and 0.702 for restaurant and electronic, which indicates the high agreement of the labeled data. Table 1 reports the results of the human evaluation. The pointwise mutual information (PMI) measure (Hamilton et al., 2016;Church and Hanks, 1990) is adopted as the baseline to compare with, which applied to each target t w.r.t. each word w. We adopt the precision of top-k (e.g., 5, 10, and 20) to measure the performance of the methods across both L and L c . From this table, we observe that: First, both our discrete perturbation and continuous perturbation algorithms perform much better than PMI. Additionally, in the restaurant domain, our model obtains more than 90% precision for L. These indicate the great effectiveness of capturing target-aware sentiment words and commonsense sentiment words. Second, the discrete perturbation method often has higher precision than continuous perturbation method, but the combination of them (Discrete+Continuous Perturbation) 6 obtains the best results in most cases. It suggests that the discrete perturbation and the continuous perturbation settings may focus on different types of opinion words.

Downstream Tasks
Besides human evaluation, we also automatically evaluate our commonsense sentiment lexicon L c with downstream tasks. Here we examine document-level sentiment analysis. In particular, for each domain, we sample 3500 documents which do not contain any general sentiment lexicon words but have obvious opinion orientations on electronic and restaurant ("Original"). Then we perform sentiment classification on the dataset with L c using two strategies.
Strategy 1 For each sample in "Original", we remove opinion words which appear in our L c , and test the performance of sentiment classification using a well-trained sentiment classifier (Section 3). Note that we only use the top-100 opinion words to make sure only fewer than five words are being deleted. To show the effectiveness of our lexicon, we compare our model with removing words randomly with the same rate (Table 2). We find that removing the words in L c performs significantly worse than both the original and random removing. It indicates that our method can capture the commonsense opinion words effectively.
Strategy 2 We apply our commonsense lexicon as extra knowledge to enhance a sentiment classification model. Here, we study the standard BiLSTM-based classifier: a BiLSTM is used to encode sentences, the last hidden vector of a sentence is adopt for classification. To inject our extracted lexicon (t, o, p), we concatenate p to the input of BiLSTM if t and o occur. We sample 1000 and 500 instances from previous 3500 samples as the training and test set. To validate the effectiveness of each model components, we also show ablation test results. Table 3 shows the results. We have the following observations.
• Our commonsense lexicon L c can significantly improve the performance of sentiment classification. L c + BiLSTM outperforms basic BiLSTM, while the model with PMI is even worse than BiL-   STM. We also find the results of the discrete perturbation and continuous perturbation method are similar, and both of them can improve the results of sentiment classification.
• L c + BiLSTM performs better than the corresponding model without distant supervision, which indicates our distant supervision can capture the target information effectively. To further verify the effectiveness of distant supervision, we also randomly select 200 samples from the set S t and evaluate them with three annotators by voting. The accuracy is 80.5% and 82% for 5-class classification over electronic and restaurant domains. Additionally, there are 71% and 65% of the samples have different polarities with their document-level label, and the accuracy of these samples is 80.99% and 81.54% in electronic and restaurant domains. These indicate our distant supervision can learn the target-oriented sentiment effectively.
• Compared with L c + BiLSTM -p (which takes whether a word is an opinion word as feature), L c + BiLSTM obtains better results. It suggest that polarity inference might be reasonable to infer the polarities of (t, o) pairs.
Additionally, to investigate the influence of sample numbers, we draw the results with different sample numbers in Figure 4. We can find that the   fewer samples, the more improvement by our commonsense lexicon.

Application (Unsupervised Opinion Extraction)
To answer Q2, we apply our lexicon into unsupervised opinion relation extraction. We test our lexicon on two datasets 7 : electronic and restaurant, which are released by (Fan et al., 2019), who labeled the opinion words towards the given target.
To investigate the performance of the targetaware sentiment lexicon L, we perform unsupervised opinion extraction on the whole dataset. Table 4 reports the experimental results. We compare our method with two methods: 1) rule-based method (Hu and Liu, 2004) use the distance and POS tags to determine the opinion words; 2) supervised LSTM was proposed by (Liu et al., 2015). We use the results reported in (Fan et al., 2019) here. From this table, we observe: First, our L performs significantly better than the rule-based method even without using any rules or human annotations. Second, our unsupervised method is comparable with the supervised method (e.g., LSTM) in Restaurant. Additionally, we explore the influence of top-k in Figure 5 (a). We can find that top-100 is recommended for L in our experiments.
To verify our method can extract commonsense opinion words accurately, we also evaluate our L c on the samples without general words. From Figure  5 (b), we can find that L c achieves 40% F1 on 7 https://github.com/NJUNLP/TOWE the restaurant domain. Considering that we don't include any general sentiment words, we think the result is quite promising.

Case Studies
To investigate the insight of commonsense sentiment mined from texts, we show several real-world examples in electronic and restaurant in this section. We present some interesting discoveries through in-depth analysis as follows.
We explore the sentiment polarity of different targets with the same opinion word here. As shown in Figure 1, we draw the targets w.r.t. opinion words "hot" and "long". We obtain the following interesting findings. First, our model can detect the commonsense sentiment in the corpus effectively. For example, our model can find that "hot" is a common-used collocation for "pizza", "CPU", and "battery", and it expresses a positive sentiment for "pizza", while it represents a negative sentiment for "CPU" and "battery". Second, domain-dependent sentiment words and their orientations are insufficient, and both the target and the opinion words are essential. For example, "long" has a positive polarity for "battery life" and negative sentiment for "charge" even both "battery life" and "charge" are in the electronic domain.
The opinion words most related to the given target (top-10) in L and L c are shown in Table  5. From this table, we obtain the following discoveries. First, our method captures not only the general opinion words but also the commonsense opinion words. Second, as mentioned in Section 4.2, though the scores could recognize "fast" expresses a positive opinion on "response", it also identifies the words are important for sentiment but not opinion words, such as "no", "not" and "never". In practice, we could filter out such cases by rules, but how to explicitly handle semantic composition in importance scores would be an important future work.
From Table 5, we observe that L and L c for different targets are quite different. To investigate whether the common-used opinion words for different targets are different, we measure it by, where T is the set of targets in our dataset, t k is the k-th target in T and L t k means the sentiment lexicon of t k . The value of div is 0.65 and 0.90 (0.89  Table 5: We list top-10 opinion words of several targets for two domains: electronic and restaurant. The marker + and * represent positive and negative sentiment respectively. and 0.96) for L and L c over restaurant (electronic).
All these indicate that commonsense lexicon L c is more diverse than general lexicon L g over different targets. In addition, the commonly used general opinion words and commonsense sentiment words are different for different targets.

Related Work
Domain adaptation has been studied for a long time in the field of sentiment analysis Choi and Cardie, 2009;Cambria et al., 2018;Zhou et al., 2020c). We mainly summarize the related work about lexicon domain adaptation that aims to build a domain-specific sentiment lexicon (Ofek et al., 2016;Vo and Zhang, 2016;Hamilton et al., 2016). In (Hamilton et al., 2016), authors inferred the orientation of words from general opinion words by building a graph for each domain. Xing et al. (2019) judged the word polarity via a document-level sentiment classifier. However, it is time-consuming for they have to retrain the model for each word after changing the polarity randomly. Moreover, these existing methods mainly focus on the domain-level, while the sentiment polarities of some words depend on their opinion targets (Liu and Zhang, 2012). It is essential to predict the sentiment in target-level by integrating both target and opinion words. The most related work to us is (Zhao et al., 2012). Zhao et al. (2012) focused on inferring the polarity of a binary tuple of a polarity word and a target via search engine, while target-aware opinion words extraction is not fully explored. To take the target into account,  proposed to construct a target-specific sentiment lexicon. However, both NLP preprocessing pipelines (e.g., parsing, POS tagging) and linguistic rules are integrated into their algorithm. Different from them, we first extract the target-aware commonsense opinion words via pretrained models, which learned rich commonsense knowledge hidden in human languages. Then, we predict the sentiment polarity of target and opinion word pair through a probing strategy. We focus on building context-aware lexicons by minimizing the requirement of annotations and handcrafted external resources.
To take the target into account,  proposed to construct a target-specific sentiment lexicon. However, both NLP preprocessing pipelines (e.g., parsing, POS tagging) and linguistic rules are integrated into their algorithm. Available resources like general sentiment lexicon and thesaurus are also made used. Since it is not easy to apply on different domains, we develop a framework to automatically mine aspect-aware commonsense sentiment from texts without extensive annotations and elaborated linguistic rules.
Pre-trained models (e.g., ELMo (Peters et al., 2018), GPT (Radford et al., 2019), BERT (Devlin et al., 2019)) have achieved great success in NLP recently. By exploring a large number of open domain texts, pre-trained models are able to encode rich semantic information hidden in human languages and thus provide new powerful tools for knowledge mining and extraction (Davison et al., 2019;Petroni et al., 2019). Since the commonsense opinions are closely related to human commonsense and background knowledge, we adopt pre-trained language models to mine the commonsense sentiment from texts automatically.
Gradient-based methods (Goodfellow et al., 2015) have been widely applied into computer version and NLP (Zeiler and Fergus, 2014;Liang et al., 2018). The gradient-based approach is also used to understand the decisions of the text classification models from the token level (Li et al., 2016;Alikaniotis et al., 2016). In addition, Rei et al. (2018) adopted gradient-based approach to detect the important tokens in the sentence via the sentence-level label. In this paper, we design a continuous perturbation algorithm to discover the target-aware opinion words using the gradient.

Conclusion
In this paper, we propose a framework for automatic target-aware sentiment mining from texts without manual annotations or linguistic rules. We evaluate the proposed framework on two largescale online review domains: restaurant and electronic with both manual checking and automatic downstream tasks. We also achieve significant improvements by applying the opinion lexicon to the task of unsupervised opinion relation extraction. To investigate the insight of commonsense sentiment mined from the texts, we visualize several real-world examples and analyze them in-depth. The extensive experimental results demonstrate the excellent performance in building a target-aware sentiment lexicon.