Reasoning Implicit Sentiment with Chain-of-Thought Prompting

While sentiment analysis systems try to determine the sentiment polarities of given targets based on the key opinion expressions in input texts, in implicit sentiment analysis (ISA) the opinion cues come in an implicit and obscure manner. Thus detecting implicit sentiment requires the common-sense and multi-hop reasoning ability to infer the latent intent of opinion. Inspired by the recent chain-of-thought (CoT) idea, in this work we introduce a Three-hop Reasoning (THOR) CoT framework to mimic the human-like reasoning process for ISA. We design a three-step prompting principle for THOR to step-by-step induce the implicit aspect, opinion, and finally the sentiment polarity. Our THOR+Flan-T5 (11B) pushes the state-of-the-art (SoTA) by over 6% F1 on supervised setup. More strikingly, THOR+GPT3 (175B) boosts the SoTA by over 50% F1 on zero-shot setting.


Introduction
Sentiment analysis (SA) aims to detect the sentiment polarity towards a given target based on the input text. SA can be classified into explicit SA (ESA) and implicit SA (ISA), where the former type is the current mainstream task, in which the emotional expressions explicitly occur in texts (Pontiki et al., 2014). Different from ESA, ISA is much more challenging, because in ISA the inputs contain only factual descriptions with no explicit opinion expression directly given (Russo et al., 2015). For example, given a text 'Try the tandoori salmon!', having no salient cue word, almost all existing sentiment classifier 1 predicts a neutral polarity towards 'the tandoori salmon'. Human can easily determine the sentiment states accurately, because we always grasp the real intent or opinion Tandoori salmon is a dish made with salmon. By saying this, the speaker is recommending the tandoori salmon, mostly because he or she believes the taste of tandoori salmon is good and worth trying. Thus the polarity of tandoori salmon is positive.  Detecting the explicit and implicit sentiment polarities towards targets.
Explicit opinion expression helps direct inference, while detecting implicit sentiment requires common-sense and multi-hop reasoning.
behind the texts. Thus, without truly understanding how the sentiment is aroused, traditional SA methods are ineffective to ISA.
In fact, it is critical to first discover the hidden opinion contexts to achieve accurate ISA. For the explicit case#1 in Fig. 1, it is effortless to capture the overall sentiment picture (e.g., 'environment' is the aspect, 'great' is the opinion), and thus can precisely infer the positive polarity towards the given target hotel. Inspired by such fine-grained sentiment spirit (Xue and Li, 2018), we consider mining the implicit aspect and opinion states. For the implicit case#2 in Fig. 1, if a model can first infer the key sentiment components, e.g., the latent aspect 'taste', latent opinion 'good and worth trying', the inference of final polarity can be greatly eased. To reach the goal, the capabilities of common-sense reasoning (i.e., infer what is 'tandoori salmon') and multi-hop reasoning (i.e., infer the aspect and then the opinion) are indispensable.
Fortunately, the recent great triumph of pretrained large-scale language models (LLMs) offers a promising solution. On the one hand, LLMs have been found to carry very rich world knowledge, showing extraordinary ability on commonsense understanding (Paranjape et al., 2021;. On the other hand, the latest chain-of-

Target
The new mobile phone can be just put in my pocket.
The sentiment polarity of the given sentence is neutral. The sentence simply states that the new mobile phone can be put in the speaker's pocket and does not express any positive or negative sentiment towards the phone.

➢ Three-hop Reasoning with CoT Prompting
Given the sentence "The new mobile phone can be just put in my pocket.", which specific aspect of the new mobile phone is possibly mentioned?
Given the sentence, "The new mobile phone can be just put in my pocket.", the specific aspect of the new mobile phone mentioned in the sentence is the size or portability.
The specific aspect of the new mobile phone mentioned in the sentence is the size or portability.

Aspect Opinion
Based on the mentioned aspect of size and portability, the sentence implies that the phone is small enough to fit in the speaker's pocket. According to common sense, the implicit opinion of speaker towards the portability is good, because the speaker is able to easily carry the phone with them by placing it in their pocket, and find the phone to be convenient and easy to use.
The sentiment polarity towards the new mobile phone based on the given sentence is positive. The speaker finds the phone to be convenient and easy to use, implying having a favorable view of the phone. Inferring polarity Appending to the next context

Input Output
Given the sentence, "The new mobile phone can be just put in my pocket.", Based on the common sense, what is the implicit opinion towards the mentioned aspect of the new mobile phone, and why?
pocket. According to common sense, the implicit opinion of speaker towards the portability is good, because the speaker can easily carry the phone with them by placing it in their pocket, and find the phone to be convenient and easy to use. Based on such opinion, what is the sentiment polarity Given the sentence, "The new mobile phone can be just put in my pocket.", the specific aspect of the new mobile phone mentioned in the sentence is the size or portability.
Given the sentence, "The new mobile phone can be just put in my pocket.", Based on the mentioned aspect of size and portability, the sentence implies that the phone is small enough to fit in thespeaker's thought (CoT) idea has revealed the great potential of LMs' multi-hop reasoning , where an LLM with some prompts can do chain-style reasoning impressively. Built on top of all these successes, in this work we implement a Three-hop Reasoning CoT framework (namely THOR) for ISA. Based on an LLM, we design three prompts for three steps of reasoning, each of which respectively infers 1) the fine-grained aspect of the given target, 2) the underlying opinion towards the aspect, and 3) the final polarity. With such easy-to-hard incremental reasoning, the hidden contexts of the overall sentiment picture can be elicited step by step to achieve an easier prediction of final polarity, which effectively alleviates the difficulties of the task prediction.
To ensure the correctness of each reasoning step, we consider a self-consistency mechanism for CoT inspired by Wang et al. (2022b), which is to select the candidate answers (at each step) with high voting consistency of inferred aspect and opinion. For supervised fine-tuning setup, we further propose a reasoning revising method. We use the intermediate reasoning answers as model inputs to predict the final labels, where the supervision from gold labels will teach LLM to generate more correct reasoning. On supervised fine-tuning setup, our Flan-T5 based THOR improves the current best-performing baseline by more than 6% in F1 score, and such margins are further magnified on zero-shot setup. Most strikingly, our GPT3-based THOR with 175B parameters boosts the baseline to a high-to 51.10% increase of F1 score.
To sum up, this work contributes a multi-hop reasoning solution for implicit sentiment detection, which helps to achieve impressive improvement over the traditional non-reasoning methods. To our knowledge, this is the first attempt to successfully extend the CoT idea to the sentiment analysis community. Our method is simple yet effective, and can be broadly applied to other similar NLP problems without much effort.

Three-hop Reasoning Framework
The task of SA (either ESA or ISA) is defined as: given a sentence X with a target term t ⊂ X, a model determines the sentiment polarity y towards t, i.e., positive, neutral or negative. We solve the task using an off-the-shelf LLM with prompt. For the standard prompt-based method, we can construct the following prompt template as LLM's input: Given the sentence X, what is the sentiment polarity towards t?
The LLM will return the answer y=argmaxp(y|X, t).

Chain-of-Thought Prompting
Now we consider the CoT-style prompt  method for multi-step reasoning. Instead of directly asking LLM the final result of y, in our THOR (cf. Fig. 2) we hope the LLM infer the latent aspect and opinion information before answering the finale y. We here define the intermediate aspect term a and latent opinion expression o. We construct the three-hop prompts as follows.
Step 1. We first ask LLM what aspect a is mentioned with the following template: C 1 [Given sentence X], which specific aspect of t is possibly mentioned?
C 1 is the first-hop prompt context. This step can be formulated as A=argmaxp(a|X, t), where A is the output text which explicitly mentions the aspect a.
Step 2. Now based on X, t and a, we ask LLM to answer in detail what would be the underlying opinion o towards the mentioned aspect a: C 2 [C 1 ,A]. Based on the common sense, what is the implicit opinion towards the mentioned aspect of t, and why?
C 2 is the second-hop prompt context which concatenates C 1 and A. This step can be written as O=argmaxp(o|X, t, a), where O is the answer text containing the possible opinion expression o.
Step 3. With the complete sentiment skeleton (X, t, a and o) as context, we finally ask LLM to infer the final answer of polarity t: O]. Based on the opinion, what is the sentiment polarity towards t?
C 3 is the third-hop prompt context. We note this step asŷ=argmaxp(y|X, t, a, o).

Enhancing Reasoning via Self-consistency
We further leverage the self-consistency mechanism (Wang et al., 2022b; to consolidate the reasoning correctness. Specifically, for each of three reasoning steps, we set the LLM decoder to generate multiple answers, each of which will likely to give varied predictions of aspect a, opinion o as well as the polarity y. At each step, those answers with high voting consistency of inferred a, o or y are kept. We select the one with highest confidence as the context in next step.

Reasoning Revising with Supervision
We can also fine-tune our THOR when the ondemand training set is available, i.e., supervised fine-tuning setup. We devise a reasoning revising method. Technically, at each step we construct a prompt by concatenating 1) initial context, 2) this step's reasoning answer text and 3) final question, and feed it into LLM to predict the sentiment label instead of going to the next step reasoning. For example, at end of step-1, we can assemble a prompt: [C 1 ,A, 'what is the sentiment polarity towards t?'].
In the supervision of gold labels, the LLM will be taught to generate more correct intermediate reasoning that is helpful to the final prediction.  Results on Zero-shot Reasoning In Table 2 we compare the zero-shot performances. We can find that the improvement of both prompt-based and CoT-based methods over the current SoTA baseline increases dramatically. But overall, the CoTbased methods with our THOR show much more significant improvement on ISA. For example, our Flan-T5-11B THOR system gives over 30% F1 average improvement over the best-performing base- line (BERT Asp +SCAPT) on two datasets. Most strikingly, when THOR is equipped into superlarge LLM, i.e., GPT3-175B, we can observe the impressive improvement, near to the level by Flan-T5-11B THOR in supervised setting as in Table 1

Influence of Different Model Sizes of LLMs
In Table 1 and 2 we have witnessed the power by using (very) large LLMs. In Fig. 3 we study the influence of different LLM scales. We see that with the increasing model scale, the efficacy of our multihop reasoning prompting is exponentially amplified. This coincides much with the existing findings of CoT prompting methods , i.e., the larger the LMs, the more significant improvement by CoT. Because when the LLM is sufficiently large, the capabilities on common-sense and multi-hop reasoning are greatly developed and strengthened.
Improving ChatGPT with THOR The latest birth of ChatGPT has brought revolutionary advancement in NLP and AI community. Here we compare the improvement of our THOR on GPT3 (175B) and ChatGPT, respectively. In Fig. 4 Figure 5: Error analysis.
has improved them on ISA very considerably.
Failure Analysis In Fig. 5 we show the error rates of failure cases when using THOR, where we summarize three error types. The Flan-T5-11B LLM gives 48.27% error rate on zero-shot setup, while it goes down to 12.79% when fine-tuned with supervision. Unsupervised-GPT3 (175B) gives similarity low error rate as with Supervised-T5, while the latter fails much frequently on incapability of reasoning. In contrast to Supervised-T5, the majority of failures in Unsupervised-GPT3 comes from the problematic data annotation. Since Supervised-T5 is fine-tuned with supervision of 'false' labels, it may actually learn the spurious correlations but with higher testing accuracy.

Related Work
Sentiment analysis has long been a hot research topic in NLP community (Pang and Lee, 2007;Dong et al., 2014;Fei et al., 2020b;Wu et al., 2021;Fei et al., 2022d). While the explicit SA models can make predictions based on the opinion expressions effortlessly Fei et al., 2022a), the implicit SA can be much more tricky due to the hidden opinion characteristics Wang et al., 2022a). And ISA is often more ubiquitous in realistic scenarios. Although efforts have been made to ISA Wang et al., 2022a), existing work can still be limited to the traditional paradigm of inference. As aforementioned, ISA should be addressed via reasoning, i.e., common-sense and multi-hop reasoning. Thus, this work follows such intuition, targeting solving ISA with a multi-hop reasoning mechanism. As a key branch of SA, the fine-grained SA has been well explored (Wang et al., 2017;Peng et al., 2020). The idea of fine-grained SA is to break down the SA into several key sentiment elements, including target, aspect, opinion and sentiment polarity, all of which together form a complete sentiment picture in detail (Shi et al., 2022;Fei et al., 2022bFei et al., , 2021aFei et al., , 2023. This work draws the same spirit of fine-grained SA. We believe the reasoning of implicit sentiment should be an incremental process, inferring the sentiment elements step by step and finally understand the sentiment polarity in an easy-to-hard manner. Language model pre-training has received increasing research attention for enhancing the utility of downstream applications (Raffel et al., 2020;Fei et al., 2020aFei et al., , 2021bFei et al., , 2022c Most recently, the large-scale language models (LLMs) have shown great potential to the human-level intelligence, e.g., ChatGPT (Ouyang et al., 2022). LLMs have extensively demonstrated to exhibit extraordinary abilities on common-sense understanding (Paranjape et al., 2021; and multi-hop reasoning . This work implements the implicit sentiment reasoning built upon LMs, based on the latest proposed chain-of-thought (CoT) idea. CoT prompting is a gradient-free technique that induces large LMs to produce intermediate reasoning steps leading to the final answer.  formally study the CoT prompting in language models, in which they elicit LMs to generate coherent series of intermediate reasoning steps that direct to the final answer to the original question.

Conclusion
In this paper, we present a Three-hop Reasoning prompting framework to achieve the chain-ofthought reasoning process for implicit sentiment analysis. Based on the existing LLM, we design three prompts for three steps of reasoning, each of which respectively infers the fine-grained aspect, the underlying opinion and the final polarity. On the ISA datasets, different LLMs equipped with our THOR show impressive performances over the existing best-performing baselines on both the supervised and zero-shot setups. We show that the larger the LLMs, the more significant improvement by our THOR method.  • Case-I Input text: I just need to walk downstairs to get to the metro station as it is below the hotel I'm living in.
The gold sentiment label is positive towards the metro station.
In Fig. 6, 7 and 8, we show that our THOR successfully induces the ChatGPT to finally give a correct decision on sentiment polarity, where the other two methods fail.
• Case-II Input text: Lunch came with pickels and slaw, no extra charge.
The gold sentiment label is positive towards Lunch. Fig. 9, 10 and 11 shows the results and the LLM's response, respectively. Our THOR induces the ChatGPT to draw a correct decision on sentiment polarity, but the other two methods still fail.