Product Feature Mining: Semantic Clues versus Syntactic Constituents

Product feature mining is a key subtask in ﬁne-grained opinion mining. Previous works often use syntax constituents in this task. However, syntax-based methods can only use discrete contextual information, which may suffer from data sparsity. This paper proposes a novel product feature mining method which leverages lexical and contextual semantic clues. Lexical semantic clue veriﬁes whether a candidate term is related to the target product, and contextual semantic clue serves as a soft pattern miner to ﬁnd candidates, which exploits semantics of each word in context so as to alleviate the data sparsity problem. We build a semantic similarity graph to encode lexical semantic clue, and employ a convolutional neural model to capture contextual semantic clue. Then Label Propagation is applied to combine both semantic clues. Experimental results show that our semantics-based method significantly outperforms conventional syntax-based approaches, which not only mines product features more accurately, but also extracts more infrequent product features.


Introduction
In recent years, opinion mining has helped customers a lot to make informed purchase decisions. However, with the rapid growth of e-commerce, customers are no longer satisfied with the overall opinion ratings provided by traditional sentiment analysis systems. The detailed functions or attributes of products, which are called product features, receive more attention. Nevertheless, a product may have thousands of features, which makes it impractical for a customer to investigate them all. Therefore, mining product features automatically from online reviews is shown to be a key step for opinion summarization (Hu and Liu, 2004;Qiu et al., 2009) and fine-grained sentiment analysis (Jiang et al., 2011;Li et al., 2012).
Previous works often mine product features via syntactic constituent matching (Popescu and Etzioni, 2005;Qiu et al., 2009;Zhang et al., 2010). The basic idea is that reviewers tend to comment on product features in similar syntactic structures. Therefore, it is natural to mine product features by using syntactic patterns. For example, in Figure 1, the upper box shows a dependency tree produced by Stanford Parser (de Marneffe et al., 2006), and the lower box shows a common syntactic pattern from (Zhang et al., 2010), where <feature/NN> is a wildcard to be fit in reviews and NN denotes the required POS tag of the wildcard. Usually, the product name mp3 is specified, and when screen matches the wildcard, it is likely to be a product feature of mp3. Figure 1: An example of syntax-based product feature mining procedure. The word screen matches the wildcard <feature/NN>. Therefore, screen is likely to be a product feature of mp3.
Generally, such syntactic patterns extract product features well but they still have some limitations. For example, the product-have-feature pattern may fail to find the fm tuner in a very similar case in Example 1(a), where the product is mentioned by using player instead of mp3. Similarly, it may also fail on Example 1(b), just with have replaced by support. In essence, syntactic pattern is a kind of one-hot representation for encoding the contexts, which can only use partial and discrete features, such as some key words (e.g., have) or shallow information (e.g., POS tags). Therefore, such a representation often suffers from the data sparsity problem (Turian et al., 2010).
One possible solution for this problem is using a more general pattern such as NP-VB-feature, where NP represents a noun or noun phrase and VB stands for any verb. However, this pattern becomes too general that it may find many irrelevant cases such as the one in Example 1(c), which is not talking about the product. Consequently, it is very difficult for a pattern designer to balance between precision and generalization.
To solve the problems stated above, it is argued that deeper semantics of contexts shall be exploited. For example, we can try to automatically discover that the verb have indicates a part-whole relation (Zhang et al., 2010) and support indicates a product-function relation, so that both sth. have and sth. support suggest that terms following them are product features, where sth. can be replaced by any terms that refer to the target product (e.g., mp3, player, etc.). This is called contextual semantic clue. Nevertheless, only using contexts is not sufficient enough. As in Example 1(d), we can see that the word flaws follows mp3 have, but it is not a product feature. Thus, a noise term may be extracted even with high contextual support. Therefore, we shall also verify whether a candidate is really related to the target product. We call it lexical semantic clue.
This paper proposes a novel bootstrapping approach for product feature mining, which leverages both semantic clues discussed above. Firstly, some reliable product feature seeds are automatically extracted. Then, based on the assumption that terms that are more semantically similar to the seeds are more likely to be product features, a graph which measures semantic similarities between terms is built to capture lexical semantic clue. At the same time, a semi-supervised convolutional neural model (Collobert et al., 2011) is employed to encode contextual semantic clue. Finally, the two kinds of semantic clues are com-bined by a Label Propagation algorithm.
In the proposed method, words are represented by continuous vectors, which capture latent semantic factors of the words (Turian et al., 2010). The vectors can be unsupervisedly trained on large scale corpora, and words with similar semantics will have similar vectors. This enables our method to be less sensitive to lexicon change, so that the data sparsity problem can be alleviated . The contributions of this paper include: • It uses semantics of words to encode contextual clues, which exploits deeper level information than syntactic constituents. As a result, it mines product features more accurately than syntaxbased methods. • It exploits semantic similarity between words to capture lexical clues, which is shown to be more effective than co-occurrence relation between words and syntactic patterns. In addition, experiments show that the semantic similarity has the advantage of mining infrequent product features, which is crucial for this task. For example, one may say "This hotel has low water pressure", where low water pressure is seldom mentioned, but fatal to someone's taste. • We compare the proposed semantics-based approach with three state-of-the-art syntax-based methods. Experiments show that our method achieves significantly better results. The rest of this paper is organized as follows. Section 2 introduces related work. Section 3 describes the proposed method in details. Section 4 gives the experimental results. Lastly, we conclude this paper in Section 5.

Related Work
In product feature mining task, Hu and Liu (2004) proposed a pioneer research. However, the association rules they used may potentially introduce many noise terms. Based on the observation that product features are often commented on by similar syntactic structures, it is natural to use patterns to capture common syntactic constituents around product features. Popescu and Etzioni (2005) designed some syntactic patterns to search for product feature candidates and then used Pointwise Mutual Information (PMI) to remove noise terms. Qiu et al. (2009) proposed eight heuristic syntactic rules to jointly extract product features and sentiment lexicons, where a bootstrapping algorithm named Double Propagation was applied to expand a given seed set. Zhang et al. (2010) improved Qiu's work by adding more feasible syntactic patterns, and the HITS algorithm (Kleinberg, 1999) was employed to rank candidates. Moghaddam and Ester (2010) extracted product features by automatical opinion pattern mining. Zhuang et al. (2006) used various syntactic templates from an annotated movie corpus and applied them to supervised movie feature extraction. Wu et al. (2009) proposed a phrase level dependency parsing for mining aspects and features of products.
As discussed in the first section, syntactic patterns often suffer from data sparsity. Furthermore, most pattern-based methods rely on term frequency, which have the limitation of finding infrequent but important product features. A recent research (Xu et al., 2013) extracted infrequent product features by a semi-supervised classifier, which used word-syntactic pattern co-occurrence statistics as features for the classifier. However, this kind of feature is still sparse for infrequent candidates. Our method adopts a semantic word representation model, which can train dense features unsupervisedly on a very large corpus. Thus, the data sparsity problem can be alleviated.

The Proposed Method
We propose a semantics-based bootstrapping method for product feature mining. Firstly, some product feature seeds are automatically extracted. Then, a semantic similarity graph is created to capture lexical semantic clue, and a Convolutional Neural Network (CNN) (Collobert et al., 2011) is trained in each bootstrapping iteration to encode contextual semantic clue. Finally we use Label Propagation to find some reliable new seeds for the training of the next bootstrapping iteration.

Automatic Seed Generation
The seed set consists of positive labeled examples (i.e. product features) and negative labeled examples (i.e. noise terms). Intuitively, popular product features are frequently mentioned in reviews, so they can be extracted by simply mining frequently occurring nouns (Hu and Liu, 2004). However, this strategy will also find many noise terms (e.g., commonly used nouns like thing, one, etc.). To produce high quality seeds, we employ a Domain Relevance Measure (DRM) (Jiang and Tan, 2010), which combines term frequency with a domain-specific measuring metric called Likelihood Ratio Test (LRT) (Dunning, 1993). Let λ(t) denotes the LRT score of a product feature candidate t, where k 1 and k 2 are the frequencies of t in the review corpus R and a background corpus 1 B, n 1 and n 2 are the total number of terms in R and B, p = (k 1 + k 2 )/(n 1 + n 2 ), p 1 = k 1 /n 1 and p 2 = k 2 /n 2 . Then a modified DRM 2 is proposed, All nouns in R are ranked by DRM (t) in descent order, where top N nouns are taken as the positive example set V + s . On the other hand, Xu et al. (2013) show that a set of general nouns seldom appear to be product features. Therefore, we employ their General Noun Corpus to create the negative example set V − s , where N most frequent terms are selected. Besides, it is guaranteed that , conflicting terms are taken as negative examples.

Capturing Lexical Semantic Clue in a Semantic Similarity Graph
To capture lexical semantic clue, each word is first converted into word embedding, which is a continuous vector with each dimension's value corresponds to a semantic or grammatical interpretation (Turian et al., 2010). Learning large-scale word embeddings is very time-consuming (Collobert et al., 2011), we thus employ a faster method named Skip-gram model (Mikolov et al., 2013).

Learning Word Embedding for
Semantic Representation Given a sequence of training words W = {w 1 , w 2 , ..., w m }, the goal of the Skip-gram model is to learn a continuous vector space EB = {e 1 , e 2 , ..., e m }, where e i is the word embedding of w i . The training objective is to maximize the average log probability of using word w t to predict a surrounding word w t+j , where c is the size of the training window. Basically, p(w t+j |w t ; e t ) is defined as, where e i is an additional training vector associated with e i . This basic formulation is impractical because it is proportional to m. A hierarchical softmax approximation can be applied to reduce the computational cost to log 2 (m), see (Morin and Bengio, 2005) for details.
To alleviate the data sparsity problem, EB is first trained on a very large corpus 3 (denoted by C), and then fine-tuned on the target review corpus R. Particularly, for phrasal product features, a statistic-based method in (Zhu et al., 2009) is used to detect noun phrases in R. Then, an Unfolding Recursive Autoencoder (Socher et al., 2011) is trained on C to obtain embedding vectors for noun phrases. In this way, semantics of infrequent terms in R can be well captured. Finally, the phrasebased Skip-gram model in (Mikolov et al., 2013) is applied on R.

Building the Semantic Similarity Graph
Lexical semantic clue is captured by measuring semantic similarity between terms. The underlying motivation is that if we have known some product feature seeds, then terms that are more semantically similar to these seeds are more likely to be product features. For example, if screen is known to be a product feature of mp3, and lcd is of high semantic similarity with screen, we can infer that lcd is also a product feature. Analogously, terms that are semantically similar to negative labeled seeds are not product features.
Word embedding naturally meets the demand above: words that are more semantically similar to each other are located closer in the embedding space (Collobert et al., 2011). Therefore, we can use cosine distance between two embedding vectors as the semantic distance measuring metric. Thus, our method does not rely on term frequency 3 Wikipedia(http://www.wikipedia.org) is used in practice.
to rank candidates. This could potentially improve the ability of mining infrequent product features.
Formally, we create a semantic similarity graph G = (V, E, W ), where V = {V s ∪ V c } is the vertex set, which contains the labeled seed set V s and the unlabeled candidate set V c ; E is the edge set which connects every vertex pair (u, v), where u, v ∈ V ; W = {w uv : cos(EB u , EB v )} is a function which associates a weight to each edge.

Encoding Contextual Semantic Clue Using Convolutional Neural Network
The CNN is trained on each occurrence of seeds that is found in review texts. Then for a candidate term t, the CNN classifies all of its occurrences. Since seed terms tend to have high frequency in review texts, only a few seeds will be enough to provide plenty of occurrences for the training.

The architecture of the Convolutional Neural Network
The architecture of the Convolutional Neural Network is shown in Figure 2. For a product feature candidate t in sentence s, every consecutive subsequence q i of s that containing t with a window of length l is fed to the CNN. For example, as in Figure 2, if t = {screen}, and l = 3, there are three inputs: q 1 = [the, ipod, screen], q 2 = [ipod, screen, is], q 3 = [screen, is, impressive]. Partially, t is replaced by a token "*PF*" to remove its lexicon influence 4 . To get the output score, q i is first converted into a concatenated vector x i = [e 1 ; e 2 ; ...; e l ], where e j is the word embedding of the j-th word. In this way, the CNN serves as a soft pattern miner: since words that have similar semantics have similar low-dimension embedding vectors, the CNN is less sensitive to lexicon change. The network is computed by, where y (i) is the output score of the i-th layer, and b (i) is the bias of the i-th layer; W (1) ∈ R h×(nl) and W (3) ∈ R 2×h are parameter matrixes, where n is the dimension of word embedding, and h is the size of nodes in the hidden layer.
In conventional neural models, the candidate term t is placed in the center of the window. However, from Example 2, when l = 5, we can see that the best windows should be the bracketed texts (Because, intuitively, the windows should contain mp3, which is a strong evidence for finding the product feature), where t = {screen} is at the boundary. Therefore, we use Equ. 6 to formulate a max-convolutional layer, which is aimed to enable the CNN to find more evidences in contexts than conventional neural models.

Training
Let θ = {EB, W (·) , b (·) } denotes all the trainable parameters. The softmax function is used to convert the output score of the CNN to a probability, where X is the input set for term t, and C = {0, 1} is the label set representing product feature and non-product feature, respectively. To train the CNN, we first use V s to collect each occurrence of the seeds in R to form a training set T s . Then, the training criterion is to minimize cross-entropy over T s , where δ i is the binomial target label distribution for one entry. Backpropagation algorithm with mini-batch stochastic gradient descent is used to solve this optimization problem. In addition, some useful tricks can be applied during the training. The weight matrixes W (·) are initialized by normalized initialization (Glorot and Bengio, 2010). W (1) is pre-trained by an autoencoder (Hinton, 1989) to capture semantic compositionality. To speed up the learning, a momentum method is applied .

Combining Lexical and Contextual Semantic Clues by Label Propagation
We propose a Label Propagation algorithm to combine both semantic clues in a unified process. Each term t ∈ V is assumed to have a label distribution L t = (p + t , p − t ), where p + t denotes the probability of the candidate being a product feature, and on the contrary, p − t = 1 − p + t . The classified results of the CNN which encode contextual semantic clue serve as the prior knowledge, where (r + t , r − t ) is estimated by, where count + (t) is the number of occurrences of term t that are classified as positive by the CNN, and count − (t) represents the negative count. Label Propagation is applied to propagate the prior knowledge distribution I to the product feature distribution L via semantic similarity graph G, so that a product feature candidate is determined by exploring its semantic relations to all of the seeds and other candidates globally. We propose an adapted version on the random walking view of the Adsorption algorithm (Baluja et al., 2008) by updating the following formula until L converges, where M is the semantic transition matrix built from G; D = Diag[log tf (t)] is a diagonal matrix of log frequencies, which is designed to assign higher "confidence" scores to more frequent seeds; and α is a balancing parameter. Particularly, when α = 0, we can set the prior knowledge I without V c to L 0 so that only lexical semantic clue is used; otherwise if α = 1, only contextual semantic clue is used.

The Bootstrapping Framework
We summarize the bootstrapping framework of the proposed method in Algorithm 1. During bootstrapping, the CNN is enhanced by Label Propagation which finds more labeled examples for training, and then the performance of Label Propagation is also improved because the CNN outputs a more accurate prior distribution. After running for several iterations, the algorithm gets enough seeds, and a final Label Propagation is conducted to produce the results.
Algorithm 1: Bootstrapping using semantic clues Input: The review corpus R, a large corpus C Output: The mined product feature list P Initialization: Train word embedding set EB first on C, and then on R Step 1: Generate product feature seeds Vs (Section 3.1) Step 2: Build semantic similarity graph G (Section 3.2) while iter < MAX ITER do Step 3: Use Vs to collect occurrence set Ts from R for training Step 4: Train a CNN N on Ts (Section 3.3) Apply mini-batch SGD on Equ. 9; Step 5: Run Label Propagation (Section 3.4) Classify candidates using N to setup I; Step 6: Expand product feature seeds Move top T terms from Vc to Vs; iter++ end Step 7: Run Label Propagation for a final result L f Rank terms by L + f to get P , where L + f > L − f ;

Datasets and Evaluation Metrics
Datasets: We select two real world datasets to evaluate the proposed method. The first one is a benchmark dataset in Wang et al. (2011), which contains English review sets on two domains (MP3 and Hotel) 5 . The second dataset is proposed by Chinese Opinion Analysis Evaluation 2008 (COAE 2008) 6 , where two review sets (Camera and Car) are selected. Xu et al. (2013) had manually annotated product features on these four domains, so we directly employ their annotation as the gold standard. The detailed information can be found in their original paper.

Experimental Settings
For English corpora, the pre-processing are the same as that in (Qiu et al., 2009), and for Chinese corpora, the Stanford Word Segmenter (Chang et al., 2008) is used to perform word segmentation. We select three state-of-the-art syntax-based methods to be compared with our method: DP uses a bootstrapping algorithm named as Double Propagation (Qiu et al., 2009), which is a conventional syntax-based method.
DP-HITS is an enhanced version of DP proposed by Zhang et al. (2010), which ranks product feature candidates by where importance(t) is estimated by the HITS algorithm (Kleinberg, 1999). SGW is the Sentiment Graph Walking algorithm proposed in (Xu et al., 2013), which first extracts syntactic patterns and then uses random walking to rank candidates. Afterwards, wordsyntactic pattern co-occurrence statistic is used as feature for a semi-supervised classifier TSVM (Joachims, 1999) to further refine the results. This two-stage method is denoted as SGW-TSVM.
LEX only uses lexical semantic clue. Label Propagation is applied alone in a self-training manner. The dimension of word embedding n = 100, the convergence threshold ε = 10 −7 , and the number of expanded seeds T = 40. The size of the seed set N is 40. To output product features, it ranks candidates in descent order by using the positive score L + f (t). CONT only uses contextual semantic clue, which only contains the CNN. The window size l is 5. The CNN is trained with a mini-batch size of 50. The hidden layer size h = 250. Finally, importance(t) in Equ. 13 is replaced with r + t in Equ. 11 to rank candidates.

MP3
Hotel Camera Car Avg.  Table 1: Experimental results of product feature mining. The precision or recall of CONT is the average performance over five runs with different random initialization of parameters of the CNN. Avg. stands for the average score.

The Semantics-based Methods vs. State-of-the-art Syntax-based Methods
The experimental results are shown in Table 1, from which we have the following observations: (i) Our method achieves the best performance among all of the compared methods. We also equally split the dataset into five subsets, and perform one-tailed t-test (p ≤ 0.05), which shows that the proposed semanticsbased method (LEX&CONT) significantly outperforms the three syntax-based strong competitors (DP, DP-HITS and SGW-TSVM).
(ii) LEX&CONT which leverages both lexical and contextual semantic clues outperforms approaches that only use one kind of semantic clue (LEX and CONT), showing that the combination of the semantic clues is helpful.
(iii) Our methods which use only one kind of semantic clue (LEX and CONT) outperform syntax-based methods (DP, DP-HITS and SGW). Comparing DP-HITS with LEX and CONT, the difference between them is that DP-HITS uses a syntax-pattern-based algorithm to estimate importance(t) in Equ. 13, while our methods use lexical or contextual semantic clue instead. We believe the reason that LEX or CONT is better is that syntactic patterns only use discrete and local information.
In contrast, CONT exploits latent semantics of each word in context, and LEX takes advantage of word embedding, which is induced from global word co-occurrence statistic. Furthermore, comparing SGW and LEX, both methods are base on random surfer model, but LEX gets better results than SGW. Therefore, the wordword semantic similarity relation used in LEX is more reliable than the word-syntactic pattern relation used in SGW.
(iv) LEX&CONT achieves the highest recall among all of the evaluated methods. Since DP and DP-HITS rely on frequency for ranking product features, infrequent candidates are ranked low in their extracted list. As for SGW-TSVM, the features they used for the TSVM suffer from the data sparsity problem for infrequent terms. In contrast, LEX&CONT is frequency-independent to the review corpus. Further discussions on this observation are given in the next section.

The Results on Extracting Infrequent Product Features
We conservatively regard 30% product features with the highest frequencies in R as frequent features, so the remaining terms in the gold standard are infrequent features. In product feature mining task, frequent features are relatively easy to find. Table 2 shows the recall of all the four approaches for mining frequent product features. We can see that the performance are very close among different methods. Therefore, the recall mainly depends on mining the infrequent features.   Figure 3 gives the recall of infrequent product features, where LEX&CONT achieves the best performance. So our method is less influenced by term frequency. Furthermore, LEX gets better recall than CONT and all syntax-based methods, which indicates that lexical semantic clue does aid to mine more infrequent features as expected.

Lexical Semantic Clue vs. Contextual Semantic Clue
This section studies the effects of lexical semantic clue and contextual semantic clue during seed expansion (Step 6 in Algorithm 1), which is controlled by α. When α = 1, we get the CONT; and if α is set 0, we get the LEX. To take into account the correctly expanded terms for both positive and negative seeds, we use Accuracy as the evaluation metric, Accuracy = #T P + #T N # Extracted Seeds where T P denotes the true positive seeds, and T N denotes the true negative seeds. Figure 4 shows the performance of seed expansion during bootstrapping, in which the accuracy is computed on 40 seeds (20 being positive and 20 being negative) expanded in each iteration. We can see that the accuracies of CONT and LEX&CONT retain at a high level, which shows that they can find reliable new product feature seeds. However, the performance of LEX oscillates sharply and it is very low for some points, which indicates that using lexical semantic clue alone is infeasible. On another hand, comparing CONT with LEX in Table 1, we can see that LEX performs generally better than CONT. Although LEX is not so accurate as CONT during seed expansion, its final performance surpasses CONT. Consequently, we can draw conclusion that CONT is more suitable for the seed expansion, and LEX is more robust for the final result production.
To combine advantages of the two kinds of semantic clues, we set α = 0.7 in Step 5 of Algorithm 1, so that contextual semantic clue plays a key role to find new seeds accurately. For Step 7, we set α = 0.3. Thus, lexical semantic clue is emphasized for producing the final results.

The Effect of Convolutional Layer
Two non-convolutional variations of the proposed method are used to be compared with the convolutional method in CONT. FW-5 uses a traditional neural network with a fixed window size of 5 to replace the CNN in CONT, and the candidate term to be classified is placed in the center of the window. Similarly, FW-9 uses a fixed window size of 9. Note that CONT uses a 5-term dynamic window containing the candidate term, so the exploited number of words in the context is equivalent to FW-9. Table 3 shows the experimental results. We can see that the performance of FW-5 is much worse than CONT. The reason is that FW-5 only exploits half of the context as that of CONT, which is not sufficient enough. Meanwhile, although FW-9 exploits equivalent range of context as that of CONT, it gets lower precisions. It is because FW-9 has approximately two times parameters in the parameter matrix W (1) than that in Equ. 5 of CONT, which makes it more difficult to be trained with the same amount of data. Also, lengths of many sentences in the review corpora are shorter than 9. Therefore, the convolutional approach in CONT is the most effective way among these settings.

Parameter Study
We investigate two key parameters of the proposed method: the initial number of seeds N , and the size of the window l used by the CNN. Figure 5 shows the performance under different N , where the F-Measure saturates when N equates to 40 and beyond. Hence, very few seeds are needed for starting our algorithm.  Figure 6 shows F-Measure under different window size l. We can see that the performance is improved little when l is larger than 5. Therefore, l = 5 is a proper window size for these datasets.

Conclusion and Future Work
This paper proposes a product feature mining method by leveraging contextual and lexical semantic clues. A semantic similarity graph is built to capture lexical semantic clue, and a convolutional neural network is used to encode contextual semantic clue. Then, a Label Propagation algorithm is applied to combine both semantic clues. Experimental results prove the effectiveness of the proposed method, which not only mines product features more accurately than conventional syntax-based method, but also extracts more infrequent product features.
In future work, we plan to extend the proposed method to jointly mine product features along with customers' opinions on them. The learnt semantic representations of words may also be utilized to predict fine-grained sentiment distributions over product features.