If you’ve got it, flaunt it: Making the most of fine-grained sentiment annotations

Fine-grained sentiment analysis attempts to extract sentiment holders, targets and polar expressions and resolve the relationship between them, but progress has been hampered by the difficulty of annotation. Targeted sentiment analysis, on the other hand, is a more narrow task, focusing on extracting sentiment targets and classifying their polarity. In this paper, we explore whether incorporating holder and expression information can improve target extraction and classification and perform experiments on eight English datasets. We conclude that jointly predicting target and polarity BIO labels improves target extraction, and that augmenting the input text with gold expressions generally improves targeted polarity classification. This highlights the potential importance of annotating expressions for fine-grained sentiment datasets. At the same time, our results show that performance of current models for predicting polar expressions is poor, hampering the benefit of this information in practice.


Introduction
Sentiment analysis comes in many flavors, arguably the most complete of which is what is often called fine-grained sentiment analysis (Wiebe et al., 2005;Liu, 2015). This approach models the sentiment task as minimally extracting all opinion holders, targets, and expressions in a text and resolving the relationships between them. This complex task is further complicated by interactions between these elements, strong domain effects, and the subjective nature of sentiment. Take the annotated sentence in Figure 1 as an example. Knowing that the target "UMUC" is modified by the expression "5 stars" and not "don't believe" is important to correctly classifying the polarity. Additionally, the fact that this is a belief held by "some others" as apposed to the author of the sentence can help us determine the overall polarity expressed in the sentence.
Compared to document-or sentence-level sentiment analysis, where distant labelling schemes can be used to obtain annotated data, fine-grained annotation of sentiment does not occur naturally, which means that current machine learning models are often hampered by the small size of datasets. Furthermore, fine-grained annotation is demanding, leads to relatively small datasets, and has low inter-annotator agreement (Wiebe et al., 2005;Wang et al., 2017). This begs the question: is it worth it to annotate full fine-grained sentiment?
Targeted sentiment (Mitchell et al., 2013;Zhang et al., 2015) is a reduction of the fine-grained sentiment task which concentrates on extracting sentiment targets and classifying their polarity, effectively ignoring sentiment holders and expressions. The benefit of this setup is that it is faster to annotate and simpler to model. But would targeted sentiment models benefit from knowing the sentiment holders and expressions?
In this work, we attempt to determine whether holder and expression information is useful for extracting and then classifying sentiment targets. Specifically, we ask the following research questions: RQ1: Given the time and difficulty required to annotate opinion holders, expressions, and polarity, is this information useful to extract sentiment targets? (c) Do target extraction models benefit from predicting the polarity of targets and/or expressions?
RQ2: Can holder and expression information improve polarity classification on extracted targets?
(a) Does augmenting the input text with holders and expressions improve polarity classification? (b) Do potential benefits of augmenting the input depend on how we model the target, i.e., using the [CLS] embeddings, mean pooling the target embeddings, etc.? (c) Can sentiment lexicons provide enough information on expressions to give improvements?
We conduct a series of experiments on eight English sentiment datasets (three with full finegrained sentiment and five targeted) with state-ofthe-art models based on fine-tuned BERT models. We show that (1) it is possible to improve target extraction by also trying to predict the polarity, and that (2) classification models benefit from having access to information about sentiment expressions. We also (3) release the code 1 to reproduce the experiments, as well as the scripts to download, preprocess, and collect the datasets into a compatible JSON format, with the hope that this allows future research on the same data.

Related work
Fine-grained approaches to sentiment analysis attempt to discover opinions from text, where each opinion is a tuple of (opinion holder, opinion target, opinion expression, polarity, intensity). Annotation of datasets for this granularity requires creating in-depth annotation guidelines, training annotators, and generally leads to lower interannotator scores than other sentiment tasks, e.g., document-or sentence-level classification, as deciding on the spans for multiple elements and their relationships is undeniably harder than choosing a single label for a full text. Targeted sentiment, on the other hand, generally concentrates only on target extraction and polarity classification. This has the benefit of allowing non-experts and crowdsourcing to perform annotation, making it easier to collect larger datasets for machine learning. This simplified annotation can be crowd-sourced, leading to larger datasets for machine learning.

Datasets
The Multi-purpose Question Answering dataset (MPQA) (Wiebe et al., 2005) is the first dataset that annotated opinion holders, targets, expressions and their relationships. The news wire data leads to complex opinions and a generally difficult task for sentiment models. Normally, the full opinion extraction task is modelled as extraction of the individual elements (holders, targets, and expressions) and the subsequent resolution of the relationship between them.
The Darmstadt Review Corpora (Toprak et al., 2010) contain annotated opinions for consumer reviews of universities and services. The authors annotate holders, targets, expressions, polarity, modifiers, and intensity. They achieve between 0.5 and 0.8 agreement using the agr method (Wiebe et al., 2005), with higher disagreement on what they call "polar targets" -targets that have a polarity but no annotated sentiment expression -holders, and expressions.
The Open Domain Targeted dataset (Mitchell et al., 2013) makes use of crowd sourcing to annotate NEs from scraped tweets in English and Spanish (Etter et al., 2013) with their polarities. The authors use majority voting to assign the final labels for the NEs, discarding tweets without sentiment consensus on all NEs. The 2014 SemEval shared task (Pontiki et al., 2014) on aspect-based sentiment analysis include labeled data from restaurant and laptop reviews for two subtasks: 1) target extraction, which they call "aspect term extraction" and 2) classification of polarity with respect to targets ("aspect term polarity").
As most targeted datasets only contain a single target, or multiple targets with the same polarity, sentence-level classifiers are strong baselines. In order to mitigate this, Jiang et al. (2019) create a Challenge dataset which has both multiple targets and multiple polarities in each sentence. Similarly, Wang et al. (2017) also point out that most targeted sentiment methods perform poorly with multiple targets and propose TDParse, a corpus of UK election tweets with multiple targets per tweet.

Modelling
Katiyar and Cardie (2016) explore jointly extracting holders, targets, and expressions with LSTMs. They find that adding sentence-level and relationlevel dependencies (IS-FROM or IS-ABOUT) improve extraction, but find that the LSTM models lag behind CRFs with rich features.
Regarding modelling the interaction between elements, there are several previous attempts to jointly learn to extract and classify targets, using factor graphs (Klinger and Cimiano, 2013), multitask learning (He et al., 2019) or sequence tagging with collapsed tagsets representing both tasks . In general, the benefits are small and have suggested that there is only a weak relationship between target extraction and polarity classification .

Data
One of the difficulties of working with finegrained sentiment analysis is that there are only a few datasets (even in English) and they come in incompatible, competing data formats, e.g., BRAT or various flavors of XML. With the goal of creating a simple unified format to work on fine-grained sentiment tasks, we take the eight datasets mentioned in Section 2 -MPQA (Wiebe et al., 2005), Darmstadt Services and Universities (Toprak et al., 2010), TDParse (Wang et al., 2017), SemEval Restaurant and Laptop (Pontiki et al., 2014), Open Domain Targeted Sentiment (Mitchell et al., 2013), and the Challenge dataset from Jiang et al. (2019) -and convert them to a standard JSON format. The datasets are sentence and word tokenized using NLTK (Loper and Bird, 2002), except for MPQA, DS. Service and DS. Uni, which already contain sentence and token spans. All polarity annotations are mapped to positive, negative, neutral, and conflict 2 . As such, each sentence contains a sentence id, the tokenized text, and a possibly empty set of opinions which contain a holder, target, expression, polarity, and intensity. We allow for empty holders and expressions in order generalize to the targeted corpora. Finally, we use 10 percent of the training data as development and another 10 percent for test for the corpora that do not contain a suggested train/dev/test split. For training and testing models, however, we convert the datasets to CoNLL format. Table 1 presents an overview of the different datasets and highlights important differences between them. The fully fine-grained sentiment datasets (MPQA, DS. Services, and DS. Uni) tend to be larger but have fewer targets annotated, due to a larger number of sentences with no targets. However, the MPQA dataset contains much longer targets than the other datasets -an average of 6, but a maximum of 56 tokens. It also contains more opinion holders and expressions and these also tend to be longer, all of which marks MPQA as an outlier among the datasets. The distribution of polarity is also highly dependent on the dataset, with DS. Services being the most skewed and SemEval Laptop the least skewed. Finally, the challenge dataset is by far the largest with over 11,000 training targets. Additionally, Table 6 in Appendix A shows the percentage of unique targets per dataset, as well as the percentage of targets shared between the training set and the dev and test sets. Again MPQA has the largest number of unique targets and the least overlap. 3

Experimental Setup
We split the task of targeted sentiment analysis into the extraction of sentiment targets and subsequent polarity classification of extracted targets, given their context. Figure 2 shows the two tasks and the eight models used in the experiments. As a base model, we take the tar-  For target extraction, we use the contextualized BERT embeddings as input to a softmax layer 4 https://www.yelp.com/dataset/ challenge and predict the sequence of tags. We compare three prediction strategies: 1. TARG.: The model predicts the labels y ∈ {B,I,O} for the targets only. 2. PRED.: We additionally predict the labels for holders and expressions and predict y ∈ {B-holder, I-holder, B-target, I-target, B-expression, I-expression, O}. 3. +POL.: Finally, we add the polarity (positive, negative, neutral) to the annotation specific BIO-tag, which leads to an inventory of 19 labels for the full fine-grained setup and 7 for the targeted setup.
For polarity classification, we take as a baseline the classification architecture from , which makes use of the two-sentence training procedure for BERT, by prepending the target before the sentence separation token, and then adding the full sentence after. We compare five strategies for producing the input to the softmax layer for predicting the sentiment of the target:  BERT embeddings for the tokens in the target. 5. MAXMM: takes the max, min, and mean pooled representations and passes the concatenation to the softmax layer, which has shown to perform well for sentiment tasks (Tang et al., 2014). However, this triples the size of the input representation to the softmax layer.
The TARG. and [CLS] models correspond to the models used in  and serve as baselines. The extraction and classification models are fine-tuned for 50 epochs using Adam with an initial learning rate of 3e−5, with a linear warmup of 0.1 and all other hyperparameters are left at default BERT settings (further details in Appendix B). The best model on the development set is used for testing. Combined with the four input manipulations (Table 2), this leads to eleven extraction experiments -TARG. and PRED. on the original data which only has annotated targets are the same and for simplicity we only show the results from TARG.-and twenty classification experiments per dataset. In order to control for the effect of random initialization, we run each experiment 5 times on different random seeds and report the mean and standard deviation.

Training with gold annotations
Given that we are interested in knowing whether it is beneficial to include information about additional annotations (holder, expressions, polarity), we perform experiments where we systematically include these. We do so by adding special tags, e.g.,, <E , into the input text surrounding the annotated spans, as shown in Table 2. The models then have access to this information both during training and at test time, albeit in an indirect way. For the first set of experiments, we perform controlled experiments under ideal conditions, i.e., having gold annotations during testing. This allows us to isolate the effects of incorporating the additional annotations, without worrying about noisy predictions

Training with predicted expressions
It is equally important to know whether the models are able to use noisy predicted annotations. In order to test this, we train expression prediction models on the three full fine-grained sentiment corpora. We use the same BERT-based model and hyperparameters from the target extraction models above and train five models with different random seeds. Preliminary results suggested that these models had high precision, but low recall. Therefore, we take a simple ensemble of the five trained models, where for each token, we keep labels predicted by at least one of the expression models in order to increase recall.
We perform an additional set of experiments where we use sentiment lexicons and assume any word in these lexicons is a sentiment expres-  Table 2: We inform our models regarding annotations other than targets by inserting special tags into the input text before and after annotated holders and expressions .
sion. We use the Hu and Liu lexicon (Hu and Liu, 2004), the SoCal and SoCal-Google lexicons (Taboada et al., 2006)

Results
In this section we describe the main results from the extraction and two classification experiments described in Section 4. Table 3 shows the results for the extraction experiment, where token-level F 1 is measured only on targets. The models perform poorer than the stateof-the-art, as we did not finetune on the SQUAD question answering dataset and in-domain sentiment questions or perform extensive hyperparameter tuning. The average F 1 score depends highly on the dataset -MPQA is the most difficult dataset with 13.1 F 1 on the original data, while the Darmstadt Universities corpus is the easiest for target extraction with 84.6. Augmenting the input text with further annotations, but predicting only sentiment targets (TARG. in Table 3) hurts the model performance in all cases. Specifically, adding holder tags leads to an average drop of 1.3 percentage points (pp), expressions 1.2 and full 1.5. Attempting to additionally predict these annotations (PRED. in Table 3) leads to mixed results -the model leads to improvements on MPQA + exp. and Darmstadt Services + holders, no notable difference on MPQA + full and Darmstadt Universities + exp., and a loss on the rest.

Target extraction
Adding the polarity to the target BIO tags (original +POL. in Table 3) leads to the most consistent improvements across experiments -an average of 0.5 pp -with the largest improvement of 1.5 pp on the TDParse dataset. This suggests a weakto-moderate relationship between polarity and extraction, which contradicts previous conclusions . Finally, further adding the holder and expression tags (+POL. in Table 3) tends to decrease performance. Table 4 shows the macro F 1 scores for the polarity classification task on the gold targets. The model performs better than the best reported results on Challenge (Jiang et al., 2019), and similar to previous results on the SemEval corpora. Regarding the choice of target representation, FIRST is the strongest overall, with an average of 64.7 F 1 across the original eight datasets, followed by MAX (64.6), MEAN (64.4), MAXMM (64.2), and finally [CLS] (64.1). It is, however, unclear exactly which representation is the best, as it differs for each dataset. But we can conclude that [CLS] is in general the weakest model, while either FIRST or MAX provide good starting points. Adding holder annotations to the input text delivers only small improvements on four of the fifteen experiments, and has losses on seven. The +exp. model, however, leads to significant improvements on 10 experiments. The outlier seems to be Darmstadt Services, which contains a large number of "polar targets" in the data, which do not have polar expressions. This may explain why including this information has less effect on this dataset. Finally, +full performs between the original input and +exp. MPQA DS. Services DS. Unis Challenge SemEval R. SemEval L.
11.6 (1) 85.0 ( Table 3: Average token-level F1 scores for the target extraction task across five runs, (standard deviation in parenthesis). Bold numbers indicate the best model per dataset, while blue and pink highlighting indicates an improvement or loss in performance compared to the original data, respectively.

Polarity classification with predicted annotations
The expression models achieve modest F 1 scores when trained and tested on the same datasetbetween 15.0 and 47.9 -, and poor scores when transferred to a different dataset -between 0.9 and 14.9 (further details shown in Table 7 in Appendix A). The lexicons often provide better cross-dataset F 1 than the expression models trained on another dataset, as they have relatively good precision on general sentiment terms. Figure 3 shows a heatmap of improvements (blue) and losses (red) on the eight datasets (xaxis) when augmenting the input text with expression tags from the expression models and lexicons (y-axis). We compare the expression augmented results to the original results for each pooling technique and take the average of these improvements and losses. For a full table of all results, see Table  5 in Appendix A.
Augmenting the input text with predicted sentiment expressions leads to losses in 41 out of averaged 56 experiments shown in Figure 3 (or in 173 out of 280 experiments in Table 5). Curiously, the experiments that use an expression model trained on the same dataset as the classification task, e.g., MPQA predicted expressions on the MPQA classification task, have the largest losses -the largest of which is MPQA (-2.78 on average). This seems to indicate that the mismatch be-tween the train prediction, which are near perfect, and the rather poor test predictions is more problematic than cross-dataset predictions, which are similar on train and test.
The best expression prediction model is the one trained on MPQA, improving the performance on Darmstadt Universties, Open, and SemEval Restaurants. This is likely due to the fact that MPQA has the largest number of annotated expressions, and that the domain is more general, leading to expression predictions that generalize better. The expression models trained on Darmstadt Services leads to small benefits on two corpora and the expression model trained on Darmstadt Universities only leads to losses The datasets that receive the most benefit from expression annotations are Darmstadt Universities (6/7 experiments) and the TDParse dataset (5/7). In both cases, the lexicon-based expression models provide more consistent benefits than the trained expression prediction models. The fact that the dataset that benefits most is the TD-Parse dataset suggests that expression information is most useful when there are multiple targets with multiple polarities.
There is no significant correlation between the performance of the expression prediction model and the performance on the classification task on the three fine-grained datasets. In fact, there is a small but insignificant negative correlation (-0.33 MPQA DS. Services DS. Unis Challenge SemEval R. SemEval L.
Open TDParse Previous Results n/a n/a n/a 70.  p=0.13, -0.16 p=0.48, -0.26 p=0.25 for macro Precision, Recall, or F 1 respectively, as measured by Pearson's correlation between the expression performances and the F 1 of the classification models augmented with these predicted expressions). It seems that the possible benefits depends more on the target dataset than the actual expression model used.

Conclusion
In this work we have explored the benefit of augmenting targeted sentiment models with holder and sentiment expressions. The experiments have shown that although augmenting text with holder and expression tags (RQ1 a) or simultaneously predicting them (RQ1 b) have no benefit for target extraction, predicting collapsed BIO + polarity tags consistently improves target extraction (RQ1 c). Furthermore, augmenting the input text with gold expressions generally improves targeted polarity classification (RQ2 a), although it is not clear which target representation strategy is best (RQ2 b). Furthermore, we have found benefits of including lexicon-based expressions for the more complex targeted datasets (RQ2 c).
The rather poor performance of the learned expression models and the difference between augmenting with gold or predicted expressions reveals the need to improve expression prediction approaches, both by creating larger corpora annotated with sentiment expressions, as well as performing further research on the modeling aspect. Any future work interested in modelling more complex sentiment phenomena should therefore be aware that we may first require more highquality annotated data if we wish to do so with current state-of-the-art machine learning approaches.
Furthermore, we introduce a common format for eight standard English datasets in fine-grained sentiment analysis and release the scripts to download and preprocess them easily. We plan to include further datasets in our script in the future, as well as extending our work to other languages with available fine-grained corpora.    Table 7: Token-level macro F 1 scores for expression prediction models (trained) and lexicon expressions (lexicons) when tested on the three fine-grained datasets (x-axis). The trained model scores are the average and standard deviation across five runs with different random seeds. The lexicon models are deterministic and therefore only have a single score.