A Rule-Based Approach to Aspect Extraction from Product Reviews

Sentiment analysis is a rapidly growing research ﬁeld that has attracted both academia and industry because of the challenging research problems it poses and the potential beneﬁts it can provide in many real life applications. Aspect-based opinion mining, in particular, is one of the fundamental challenges within this research ﬁeld. In this work, we aim to solve the problem of aspect extraction from product reviews by proposing a novel rule-based approach that exploits common-sense knowledge and sentence dependency trees to detect both explicit and implicit aspects. Two popular review datasets were used for evaluating the system against state-of-the-art aspect extraction techniques, obtaining higher detection accuracy for both datasets.


Introduction
In opinion mining, different levels of granularity analysis have been proposed, each one having its own advantages and disadvantages. Aspect-based opinion mining (Hu and Liu, 2004;Ding et al., 2008) focuses on the extraction of aspects (or product features) from opinionated text and on the inference of polarity values associated with these. For example, a sentence like "I love the touchscreen of my phone but the battery life is so short" contains two aspects or opinion targets, namely touchscreen and battery life. In this case, applying a sentence level polarity detection technique would mistakenly result in a polarity value close to neutral, since the two opinions expressed by the users are opposite. Hence, aspect extraction is necessary to first deconstruct sentences into product features and then assign a separate polarity value to each of these features.
There are two types of aspects defined in aspect-based opinion mining: explicit and implicit. Explicit aspects are concepts that explicitly denote targets in the opinionated sentence. For instance, in the above example, touchscreen and battery life are explicit aspects as they are explicitly mentioned in the sentence. On the other hand, an aspect can also be expressed indirectly through an implicit aspect clue (IAC), e.g., in the sentence "This camera is sleek and very affordable", which implicitly provides a positive opinion about the aspects appearance and price of the entity camera.
Explicit aspect extraction has been widely researched and there exists several approaches for this task. Still, limited work has been done in extracting implicit aspects. This task is very difficult yet very important because the phenomenon of implicit aspects is present in nearly every opinionated document. For example, the following document extracted from the corpus (Hu and Liu, 2004) uses only implicit aspects: This is the best phone one could have. It has all the features one would need in a cellphone: It is lightweight, sleek and attractive. I found it very user-friendly and easy to manipulate; very convenient to scroll in menu etc.
Here, the word "lightweight" refers to the weight of the phone; the words "sleek" and "attractive" to its appearance; the compound "user-friendly" to its interface; the phrase "easy to manipulate" to its functionality; finally, the phrase "to scroll in menu" can be interpreted as a reference to the interface of the phone or its menu. Even though the aspects appearance, weight and interface do not appear in the sentence, the context contains clues that permit us to infer them. Namely, the words "sleek," "lightweight," and "user-friendly" that do occur in the context suggest these aspects.
In contrast to the task of identification of explicit aspects, the general scheme for identification of implicit aspects, a task called implicit aspect extraction, typically involves two steps: 1. Identify IACs (e.g., "sleek") in the opinionated document.
In this paper, we propose a novel approach to detect explicit aspects and IACs from opinionated documents. We also map IACs to their respective aspect categories. IACs are either single words, such as "sleek," or multi-word expressions, such as "easy to manipulate" as in the above example. Each IAC can be represented by a different part-of-speech (POS): in the example "This MP3 player is really expensive," the IAC "expensive" suggesting the aspect price is an adjective; in "This camera looks great," the IAC "look" suggesting appearance is a verb; in "I hate this phone. It only lasted less than six months!", the IAC "lasted" suggesting durability of the phone is a verb. In the following examples, IACs are nouns or noun phrases: "Even if I had paid full price I would have considered this phone a good deal," "Not to mention the sleekness of this phone", "The player keeps giving random errors", "This phone is a piece of crap." In different contexts, the same implicit aspect can be implied by different IACs, as shown below for the implicit aspect price: • This mp3 player is very affordable.
• This mp3 player also costs a lot less than the ipod.
• This mp3 player is quite cheap.
• I bought this mp3 for almost nothing!
• This mp3 player has been fairly innovative and reasonably priced.
A common approach for IAC identification is to assume that sentiments or polarity words are good candidates for IACs: for example, in "This MP3 player is really expensive," the word "expensive", which bears negative polarity, is also the IAC for the aspect price. However, this is not always true. For example, in "This camera looks great," the word "looks" implies the appearance of the phone, while polarity is conveyed through the word "great." In "I hate this phone. It only lasted less than six months!", the word "lasted" is the IAC for durability of the phone, while polarity is indicated by "hate." Furthermore, the second sentence of this example could appear without the first one: "This phone only lasted less than six months" and still constitute a negative opinion of the phone's durability, but not expressed by any specific word.
This phenomenon is known in opinion mining as desirable fact: communicating fact that by commonsense are good or bad, which indirectly implies polarity. For example, the objective fact "The camera can hold lots of pictures" does not contain any sentiment or polarity word yet gives a positive opinion about the camera's memory capacity (IAC "hold"), because it is desirable for a camera to hold many pictures.
In this paper, we present a rule-based approach that exploits common-sense knowledge and sentence dependency trees to detect both implicit and explicit aspects. In particular, the approach draws lessons from recent developments in common-sense reasoning (Cambria et al., 2011;Cambria et al., 2014a) and concept-level sentiment analysis (Xia et al., 2013;Poria et al., 2014) to first obtain the dependency structure of each sentence and, hence, exploit external knowledge to extract aspects and infer the polarity associated with them. The paper is organized as follows: Section 2 presents the literature in aspect extraction; Section 3 explains the features used for the labeler; Section 4 discusses novelty of the proposed methodology; Section 5 describes in detail the aspect extraction approach and results of the experimental evaluation; finally, Section 6 concludes the paper.

Related Work
Aspect extraction from opinionated text was first studied by Hu and Liu (Hu and Liu, 2004), who also introduced the distinction between explicit and implicit aspects. However, the authors only dealt with explicit aspects by adopting a set of rules based on statistical observations. Hu and Liu's method was improved by Popescu and Etzioni (Popescu and Etzioni, 2005) and by Blair-Goldensonh (Blair-Goldensohn et al., 2008). Popescu and Etzioni assumed the product class to be known as priori. Their algorithm detects whether a noun or noun phrase is a product feature or not by computing PMI between the noun phrase and the product class. Scaffidi et al. (Scaffidi et al., 2007) presented a method that uses a language model to identify product features. They assumed that product features are more frequent in product reviews than in general natural language text. However, their method seems to be very inaccurate in terms of precision as the retrieved aspects extracted by their method were very noisy.
Aspect extraction can be seen as a general information extraction problem, for which techniques based on sequential labeling are generally used. The most popular methods in this context, in particular, are Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Lafferty et al., 2001). Jin and Ho (Jin and Ho, 2009) used a lexicalized HMM for joint extraction of opinions along with their explicit aspects. Niklas and Gurevych (Niklas and Gurevych, 2010) used CRF to extract explicit aspects in a custom corpus with data of different domains. Li et al. (Li et al., 2010), Choi and Cardie (Choi and Cardie, 2010) and Huang et al. (Huang et al., 2012) also used CRF for extraction of explicit aspects.
As to the implicit aspects, the OPINE extraction system developed by Popescu and Etzioni (Popescu and Etzioni, 2005) was the first that leveraged on the extraction of this type of aspects to improve polarity classification. However, their system is not described in detail and is not publicly available. To the best of our knowledge, all existing methods for implicit aspect extraction are based on the use, in one or another way, of what we term IAC. Su (Su et al., 2008) proposed a clustering method to map IACs (which were assumed to be sentiment words) to their corresponding explicit aspects. The method exploits the mutual reinforcement relationship between an explicit aspect and a sentiment word forming a cooccurring pair in a sentence. Hai (Zhen et al., 2011) proposed a two-phase co-occurrence association rule mining approach to match implicit aspects (which were also assumed to be sentiment words) with explicit aspects. Finally, Zeng and Li (Zeng and Li, 2013) proposed a rule-based method to extract explicit aspects and mapped implicit features by using a set of sentiment words and by clustering explicit feature-word pairs.

Corpus for aspect extraction
In order to evaluate the explicit aspect extraction algorithm, we use the corpus provided by (Hu and Liu, 2004) and the Semeval 2014 dataset 1 ( Table 1). As for the implicit aspect extraction algorithm and lexicon, we use the corpus developed by Cruz-Garcia et al. (Cruz-Garcia et al., 2014), who manually labeled each IAC and their corresponding aspects in a well-known corpus for opinion mining (Hu and Liu, 2004). The corpus is publicly available for research purposes. 2

Pre-Processing
Pre-processing is a key step for aspect parsing. The pre-processing module of the proposed framework consists of two major steps: firstly, the sentence dependency tree is obtained through Stanford Dependency Parser 3 ; secondly, dependency structure elements are processed by means of Stanford Lemmatizer for each sentence. It is important to build the dependency tree before lemmatization as swapping the two steps results in several imprecisions caused by the lower grammatical accuracy of lemmatized sentences.

Implicit aspect lexicon
We use the implicit aspect corpus developed by Cruz-Garcia et al. (Cruz-Garcia et al., 2014), where IACs are indicated and manually labeled by their corresponding aspect categories. For our task, we extracted the sentences having implicit aspects and then extracted IACs for each of them, along with their corresponding labeled categories. For example, in "The car is expensive" the IAC is expensive and it is labeled by the category price. Below is the list of the aspect categories extracted from the corpus: • functionality For each IAC under every aspect category, synonyms and antonyms were obtained from WordNet (Fellbaum, 1998) and stored under the same aspect category. For example, expensive and its antonym inexpensive both have the same category price. Semantics extracted from SenticNet (Cambria et al., 2014b) have also been exploited to enlarge the set of conceptually related IACs. Thus, a lexicon of 1,128 IACs categorized into the above categories was built.

Opinion Lexicon
We use SenticNet 3 as a concept-level opinion lexicon. The common-sense knowledge base contains 30,000 multi-word expressions labeled by their polarity scores. The proposed aspect parser is based on two general rules: • Rules for the sentences having subject verb.
• Rules for the sentences which do not have subject verb.
A dependency relation is a binary relation characterized by the following features: • The type of the relation that specifies the nature of the (syntactic) link between the two elements in the relation.
• The head of the relation: this is the element that is the pivot of the relation. Core syntactic and semantics properties (e.g., agreement) are inherited from the head.
• The dependent is the element that depends on the head and which usually inherits some of its characteristics (e.g., number, gender in the case of agreement).
Most of the times, the active token is considered in a relation if it acts as the head of the relation, although there are exceptions. Once the active token has been identified as the trigger for a rule, there are several ways to compute its contribution, depending on how the dependency relation and the properties of the tokens match with the rules. The preferred way is not to consider the contribution of the token alone, but in combination with the other elements in the dependency relation. First of all, Stanford parser is used to obtain the dependency parse structure of each sentence. Then, hand-crafted dependency rules are employed on the parse trees to extract aspects.

Subject Noun Rule
Trigger: when the active token is found to be the syntactic subject of a token. Behavior: if an active token h is in a subject noun relationship with a word t then: 1. if t has any adverbial or adjective modifier and the modifier exists in SenticNet, then t is extracted as an aspect.
2. if the sentence does not have auxiliary verb, i.e., is, was, would, should, could, then: • if the verb t is modified by an adjective or an adverb or it is in adverbial clause modifier relation with another token, then both h and t are extracted as aspects. In (1), battery is in a subject relation with lasts and lasts is modified by the adjective modifier little, hence both the aspects last and battery are extracted.
• if t has any direct object relation with a token n and the POS of the token is Noun and n is not in SenticNet, then n is extracted as an aspect. In (2), like is in direct object relation with lens so the aspect lens is extracted.
(2) I like the lens of this camera.
• if t has any direct object relation with a token n and the POS of the token n is Noun and n exists in SenticNet, then the token n extracted as aspect term. In the dependency parse tree of the sentence, if another token n 1 is connected to n using any dependency relation and the POS of n 1 is Noun, then n 1 is extracted as an aspect. In (3), like is in direct object relation with beauty which is connected to screen via a preposition relation. So the aspects screen and beauty are extracted.
(3) I like the beauty of the screen.
• if t is in open clausal complement relation with a token t 1 , then the aspect t-t 1 is extracted if t-t 1 exists in the opinion lexicon. If t 1 is connected with a token t 2 whose POS is Noun, then t 2 is extracted as an aspect. In (4), like and comment is in clausal complement relation and comment is connected to camera using a preposition relation. Here, the POS of camera is Noun and, hence, camera is extracted as an aspect.
(4) I would like to comment on the camera of this phone.
3. A copula is the relation between the complement of a copular verb and the copular verb. If the token t is in copula relation with a copular verb and the copular verb exists in the implicit aspect lexicon, then t is extract as aspect term. In (5), expensive is extracted as an aspect.
(5) The car is expensive.
4. If the token t is in copula relation with a copular verb and the POS of h is Noun, then h is extracted as an explicit aspect. In (6), camera is extracted as an aspect.
(6) The camera is nice.
5. If the token t is in copula relation with a copular verb and the copular verb is connected to a token t 1 using any dependency relation and t 1 is a verb, then both t 1 and t are extracted as implicit aspect terms, as long as they exist in the implicit aspect lexicon. In (7), lightweight is in copula relation with is and lightweight is connected to the word carry by open clausal complement relation. Here, both lightweight and carry are extracted as aspects.
( 7) The phone is very lightweight to carry.

Sentences which do not have subject noun relation in their parse tree
For sentences that do not have noun subject relation in their parse trees, aspects are extracted using the following rules: 1. if an adjective or adverb h is in infinitival or open clausal complement relation with a token t and h exists in the implicit aspect lexicon, then h is extracted as an aspect. In (8), big is extracted as an aspect as it is connected to hold using a clausal complement relation.
2. if a token h is connected to a noun t using a prepositional relation, then both h and t are extracted as aspects. In (9) sleekness is extracted as an aspect.
(9) Love the sleekness of the player.
3. if a token h is in a direct object relation with a token t, t is extracted as aspect. In (10), mention is in a direct object relation with price, hence price is extracted as an aspect.
(10) Not to mention the price of the phone.

Additional Rules
• For each aspect term extracted above, if an aspect term h is in co-ordination or conjunct relation with another token t, then t is also extracted as an aspect. In (11), amazing is firstly extracted as an aspect term. As amazing is in conjunct relation with easy, then use is also extracted as an aspect.
(11) The camera is amazing and easy to use.
• A noun compound modifier of an NP is any noun that serves to modify the head noun. If t is extracted as an aspect and t has noun compound modifier h, then the aspect h-t is extracted and t is removed from the aspect list. In (12), as chicken and casserole are in noun compound modifier relation, only chicken casserole is extracted as an aspect.
We ordered the chicken casserole, but what we got were a few small pieces of chicken, all dark meat and on the bone.

Novelty of the proposed work
First of all, the proposed method is fully unsupervised and depends on the accuracy of the dependency parser and the opinion lexicon, rather then a training corpus and supervised learning accuracy. Only (Qiu et al., 2011) follow an unsupervised learning approach but the proposed method uses an enhanced set of rules and opinion lexicon. The proposed method also outperforms (Qiu et al., 2011) on the same dataset they used. Implicit aspects extracted through the proposed method differ from implicit aspect expressions defined by Liu (Liu, 2012) as "aspect expressions that are not nouns or noun phrases" in that implicit aspects extracted by the proposed algorithm semantically refer to the values of the pre-defined aspects, irrespective of their own surface POS. Below are listed some examples where the implicit aspect terms are either noun or noun phrases.
In (13), the IAC deal is extracted.
(13) Even if I had paid full price I would have considered this phone a good deal.
(14) Not to mention the sleekness of this phone.
In (15), the IAC errors is extracted by the algorithm.
(15) The player keeps giving random errors.
In (16), piece of crap is a noun phrase and is extracted as an IAC by the proposed algorithm.
(16) This phone is a piece of crap.
A demo of the developed aspect parser is freely available at http://sentic.net/demo.      (Hu and Liu, 2004) Experimental evaluation was carried out on the dataset derived from (Hu and Liu, 2004). As discussed in Section 3, the proposed method is able to extract both explicit and implicit aspects. To the best of our knowledge, there is no state-of-the-art benchmark to evaluate implicit aspect extraction. We compare the proposed framework with those in Hu and Liu (Hu and Liu, 2004), Qiu et al. (Qiu et al., 2011), and Popescu and Etzioni (Popescu and Etzioni, 2005) (which only carried out explicit aspect extraction). Table 2, Table 3, Table 4, Table 5 and Table 6 show that the proposed framework outperforms all existing methods in terms of both precision and recall.

Conclusion
We have illustrated a method for extracting both explicit and implicit aspects from opinionated text. The proposed framework only leverages on common-sense knowledge and on the dependency structure of sentences and, hence, is unsupervised. As future work, we aim to discover more rules for aspect extraction. Another key future effort is to combine existing rules for complex aspect extraction. To obtain the aspect categories of IACs, we have developed an aspect knowledge base using WordNet and SenticNet. We will focus on extending the scalability of such knowledge base and on making it as much noise-free as possible.

Experiment on Semeval 2014 dataset
We also carried out experiments on Semeval 2014 aspect based sentiment analysis data obtained from http://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools. Results are shown in Table 7. We cannot perform a comparative evaluation of such experimental results as there is no state-of-art approach yet which used this dataset for the same kind of experiment. Overall, results show high accuracy.