YASO: A Targeted Sentiment Analysis Evaluation Dataset for Open-Domain Reviews

Current TSA evaluation in a cross-domain setup is restricted to the small set of review domains available in existing datasets. Such an evaluation is limited, and may not reflect true performance on sites like Amazon or Yelp that host diverse reviews from many domains. To address this gap, we present YASO – a new TSA evaluation dataset of open-domain user reviews. YASO contains 2,215 English sentences from dozens of review domains, annotated with target terms and their sentiment. Our analysis verifies the reliability of these annotations, and explores the characteristics of the collected data. Benchmark results using five contemporary TSA systems show there is ample room for improvement on this challenging new dataset. YASO is available at https://github.com/IBM/yaso-tsa.


Introduction
Targeted Sentiment Analysis (TSA) is the task of identifying the sentiment expressed towards single words or phrases in texts. For example, given the sentence "it's a useful dataset with a complex download procedure" the desired output is identifying dataset and download procedure, with a positive and negative sentiments expressed towards them, respectively. Our focus in this work is on TSA of user reviews data in English.
Till recently, typical TSA evaluation was indomain, for example, by training on labeled restaurant reviews and testing on restaurant reviews. New works (e.g. Rietzler et al. (2020)) began considering a cross-domain setup, training models on labeled data from one or more domains (e.g., restaurant reviews) and evaluating on others (e.g., laptop reviews). For many domains, such as car or book reviews, TSA data is scarce or non-existent. This suggests that cross-domain experimentation is more realistic, as it aims at training on a small set of labeled domains and producing predictions for reviews from any domain. Naturally, the evaluation in this setup should resemble real-world content from sites like Amazon or Yelp that host reviews from dozens or even hundreds of domains. 1 Existing English TSA datasets do not facilitate such a broad evaluation, as they typically include reviews from a small number of domains. For example, the popular SEMEVAL (SE) datasets created by Pontiki et al. (2014Pontiki et al. ( , 2015Pontiki et al. ( , 2016 (henceforth SE14, SE15, and SE16, respectively), contain English reviews of restaurants, laptops and hotels (see §2 for a discussion of other existing datasets). To address this gap, we present YASO, 2 a new TSA dataset collected over user reviews taken from four sources: the YELP and AMAZON (Keung et al., 2020) datasets of reviews from those two sites; the Stanford Sentiment Treebank (SST) movie reviews corpus (Socher et al., 2013); and the OPINOSIS dataset of reviews from over 50 topics (Ganesan et al., 2010). To the best of our knowledge, while these resources have been previously used for sentiment analysis research, they were not annotated and used for targeted sentiment analysis. The new YASO evaluation dataset contains 2215 annotated sentences, on par with the size of existing test sets (e.g., one of the largest is the SE14 test set, with 1,600 sentences).
The annotation of open-domain reviews data is different from the annotation of reviews from a small fixed list of domains. Ideally, the labels would include both targets that are explicitly mentioned in the text, as well as aspect categories that are implied from it. For example, in "The restaurant serves good but expensive food" there is a sentiment towards the explicit target food as well as towards the implied category price. This approach of aspect-based sentiment analysis (Liu, 2012) is implemented in the SE datasets. However, because the categories are domain specific, the annotation of each new domain in this manner first requires defining a list of relevant categories, for example, reliability and safety for cars, or plot and photography for movies. For open-domain reviews, curating these domain-specific categories over many domains, and training annotators to recognize them with per-domain guidelines and examples, is impractical. We therefore restrict our annotation to sentiment-bearing targets that are explicitly present in the review, as in the annotation of open-domain tweets by Mitchell et al. (2013).
While some information is lost by this choice, which may prohibit the use of the collected data in some cases, it offers an important advantage: the annotation guidelines can be significantly simplified. This, in turn, allows for the use of crowd workers who can swiftly annotate a desired corpora with no special training. Furthermore, the produced annotations are consistent across all domains, as the guidelines are domain-independent.
TSA annotation in a pre-specified domain may also distinguish between targets that are entities (e.g., a specific restaurant), a part of an entity (e.g., the restaurant's balcony), or an aspect of an entity (e.g., the restaurant's location). For example, Pontiki et al. (2014) use this distinction to exclude targets that represent entities from their annotation. In an open-domain annotation setup, making such a distinction is difficult, since the reviewed entity is not known beforehand.
Consequently, we take a comprehensive approach and annotate all sentiment-bearing targets, including mentions of reviewed entities or their aspects, named entities, pronouns, and so forth. Notably, pronouns are potentially important for the analysis of multi-sentence reviews. For example, given "I visited the restaurant. It was nice.", identifying the positive sentiment towards It allows linking that sentiment to the restaurant, if the coreference is resolved.
Technically, we propose a two-phase annotation scheme. First, each sentence is labeled by five annotators that should identify and mark all target candidates -namely, all terms to which sentiment is expressed in the sentence. Next, each target candidate, in the context of its containing sentence, is labeled by several annotators who determine the sentiment expressed towards the candidate -either positive, negative, or mixed (if any). 3 The full scheme is exemplified in Figure 1. We note that this scheme is also applicable to general non-review texts (e.g., tweets or news).
Several analyses are performed on the collected data: (i) its reliability is established through a manual analysis of a sample; (ii) the collected annotations are compared with existing labeled data, when available; (iii) differences from existing datasets are characterized. Lastly, benchmark performance on YASO was established in a cross-domain setup. Five state-of-the-art (SOTA) TSA systems were reproduced, using their available codebases, trained on data from SE14, and applied to predict targets and their sentiments over our annotated texts.
In summary, our main contributions are (i) a new domain-independent annotation scheme for collecting TSA labeled data; (ii) a new evaluation dataset with target and sentiment annotations of 2215 opendomain review sentences, collected using this new scheme; (iii) a detailed analysis of the produced annotations, validating their reliability; and (iv) reporting cross-domain benchmark results on the new dataset for several SOTA baseline systems. All collected data are available online. 4 2 Related work Review datasets The Darmstadt Review Corpora (Toprak et al., 2010) contains annotations of user reviews in two domains -online universities and online services. Later on, SE14 annotated laptop and restaurant reviews (henceforth SE14-L and SE14-R). In SE15 a third domain (hotels) was added, and SE16 expanded the English data for the two original domains (restaurants and laptops). Jiang et al. (2019) created a challenge dataset with multiple targets per-sentence, again within the restaurants domain. Saeidi et al. (2016) annotated opinions from discussions on urban neighbourhoods. Clearly, the diversity of the reviews in these datasets is limited, even when taken together.
Non-review datasets The Multi-Purpose Question Answering dataset (Wiebe et al., 2005) was the first opinion mining corpus with a detailed annotation scheme applied to sentences from news documents. Mitchell et al. (2013) annotated opendomain tweets using an annotation scheme similar to ours, where target candidates were annotated for their sentiment by crowd-workers, yet the annotated terms were limited to automatically detected (a) Target candidates annotation (b) Sentiment annotation Figure 1: The UI of our two-phase annotation scheme (detailed in §3): Target candidates annotation (top) allows multiple target candidates to be marked in one sentence. In this phase, aggregated sentiments for candidates identified by a few annotators may be incorrect (see §5 for further analysis). Therefore, marked candidates are passed through a second sentiment annotation (bottom) phase, which separately collects their sentiments. named entities. Other TSA datasets on Twitter data include targets that are either celebrities, products, or companies (Dong et al., 2014), and a multi-target corpus on UK elections data (Wang et al., 2017a). Annotation Scheme Our annotation scheme is reminiscent of two-phase data collection efforts in other tasks. These typically include an initial phase where annotation candidates are detected, followed by a verification phase that further labels each candidate by multiple annotators. Some examples include the annotation of claims (Levy et al., 2014), evidence (Rinott et al., 2015) or mentions (Mass et al., 2018).
Modeling TSA can be divided into two subtasks: target extraction (TE), focused on identifying all sentiment targets in a given text; and sentiment classification (SC), of determining the sentiment towards a specific candidate target in a given text. TSA systems are either pipelined systems running a TE model followed by an SC model (e.g., Karimi et al. (2020)), or end-to-end (sometimes called joint) systems using a single model for the whole task, which is typically regarded as a sequence labeling problem (Li and Lu, 2019;Li et al., 2019a;Hu et al., 2019;He et al., 2019). Earlier works (Tang et al., 2016a,b;Ruder et al., 2016;Ma et al., 2018;Huang et al., 2018;He et al., 2018) have utilized pre-transformer models (see surveys by Schouten and Frasincar (2015); Zhang et al. (2018)). Recently, focus has shifted to using pretrained language models (Sun et al., 2019;Song et al., 2019;Zeng et al., 2019;Phan and Ogunbona, 2020). Generalization to unseen domains has also been explored with pre-training that includes domain-specific data (Xu et al., 2019;Rietzler et al., 2020), adds sentiment-related objectives (Tian et al., 2020), or combines instance-based domain adaptation (Gong et al., 2020).

Input Data
The input data for the annotation was sampled from the following datasets: -YELP: 5 A dataset of 8M user reviews discussing more than 200k businesses. The sample included 129 reviews, each containing 3 to 5 sentences with a length of 8 to 50 tokens. The reviews were sentence split, yielding 501 sentences.
-AMAZON: 6 A dataset in 6 languages with 210k reviews per language (Keung et al., 2020). The English test set was sampled in the same manner as YELP, yielding 502 sentences from 151 reviews.
-SST: 7 A corpus of 11,855 movie review sentences (Socher et al., 2013) originally extracted from Rotten Tomatoes by Pang and Lee (2005). 500 sentences, with a minimum length of 5 tokens, were randomly sampled from its test set.
-OPINOSIS: 8 A corpus of 7,086 user review sen-tences from Tripadvisor (hotels), Edmunds (cars), and Amazon (electronics) (Ganesan et al., 2010). Each sentence discusses a topic comprised of a product name and an aspect of the product (e.g. "performance of Toyota Camry"). At least 10 sentences were randomly sampled from each of the 51 topics in the dataset, yielding 512 sentences.
Overall, the input data includes reviews from many domains not previously annotated for TSA, such as books, cars, pet products, kitchens, movies or drugstores. Further examples are detailed in Appendix A.
The annotation input also included 200 randomly sampled sentences from the test sets of SE14-L and SE14-R (100 per domain). Such sentences have an existing annotation of targets and sentiments, which allows a comparison against the results of our proposed annotation scheme (see §5).

YASO
Next, we detail the process of creating YASO. An input sentence was first passed through two phases of annotation, followed by several post-processing steps. Figure 2 depicts an overview of that process, as context to the details given below.

Annotation
Target candidates annotation Each input sentence was tokenized (using spaCy by Honnibal and Montani (2017)) and shown to 5 annotators who were asked to mark target candidates by selecting corresponding token sequences within the sentence. Then, they were instructed to identify the sentiment expressed towards the candidatepositive, negative, or mixed ( Figure 1a). This step is recall-oriented, without strict quality control, and some candidates may be detected by only one or two annotators. In such cases, sentiment labels based on annotations from this step alone may be incorrect (see §5 for further analysis).
Selecting multiple non-overlapping target candidates in one sentence was allowed, each with its own sentiment. To avoid clutter and maintain a reasonable number of detected candidates, the selection of overlapping spans was prohibited.

Sentiment annotation
To verify the correctness of the target candidates and their sentiments, each candidate was highlighted within its containing sentence, and presented to 7 to 10 annotators who were asked to determine its sentiment (without being shown the sentiment chosen in the first phase).
For cases in which an annotator believes a candidate was wrongly identified and has no sentiment expressed towards it, a "none" option was added to the original labels ( Figure 1b).
To control the quality of the annotation in this step, test questions with an a priori known answer were interleaved between the regular questions. A per-annotator accuracy was computed on these questions, and under-performers were excluded. Initially, a random sample of targets was labeled by two of the authors, and cases in which they agreed were used as test questions in the first annotation batch. Later batches also included test questions formed from unanimously answered questions in previously completed batches.
All annotations were done using the Appen platform. 9 Overall, 20 annotators took part in the target candidates annotation phase, and 45 annotators worked on the sentiment annotation phase. The guidelines for each phase are given in Appendix B.

Post-processing
The sentiment label of a candidate was determined by majority vote from its sentiment annotation answers, and the percentage of annotators who chose that majority label is the annotation confidence. A threshold t defined on these confidence values (set to 0.7 based on an analysis detailed below) separated the annotations between high-confidence targets (with confidence ≥ t) and low-confidence targets (with confidence < t).
A target candidate was considered as valid when annotated with high-confidence with a particular sentiment (i.e., its majority sentiment label was not "none"). The valid targets were clustered by considering overlapping spans as being in the same cluster. Note that non-overlapping targets may be clustered together, for example, if t 1 , t 2 , t 3 are valid targets, t 1 overlaps t 2 and t 2 overlaps t 3 , then all three are in one cluster, regardless of whether t 1 and t 3 overlap. The sentiment of a cluster was set to the majority sentiment of its members.
The clustering is needed for handling overlapping labels when computing recall. For example, given the input "The food was great", and the annotated (positive) targets The food and food, a system which outputs only one of these targets should be evaluated as achieving full recall. Representing both labels as one cluster allows that (see details in §6). An alternative to our approach is considering  Figure 2: The process for creating YASO, the new TSA evaluation dataset. An input sentence is passed through two phases of annotation (in orange), followed by four post-processing steps (in green).
any prediction that overlaps a label as correct. In this case, continuing the above example, an output of food or The food alone will have the desired recall of 1. Obviously, this alternative comes with the disadvantage of evaluating outputs with an inaccurate span as correct, e.g., an output of food was great will not be evaluated as an error.

Results
Confidence The per-dataset distribution of the confidence in the annotations is depicted in Figure 3a. For each confidence bin, one of the authors manually annotated a random sample of 30 target candidates for their sentiments, and computed a per-bin annotation error rate (see Table 1). Based on this analysis, the confidence threshold for valid targets was set to 0.7, since under this value the estimated annotation error rate was high. Overall, around 15%-25% of all annotations were considered as low-confidence (light red in Figure 3a).
Error 33.3% 10% 3.3% 3.3% Sentiment labels Observing the distribution of sentiment labels annotated with high-confidence ( Figure 3b), hardly any targets were annotated as mixed, and in all datasets (except AMAZON) there were more positive labels than negative ones. As many as 40% of the target candidates may be labeled as not having a sentiment in this phase (grey in Figure 3b), demonstrating the need for the second annotation phase.
Clusters While a cluster may include targets of different sentiments, in practice, cluster members were always annotated with the same sentiment, further supporting the quality of the sentiment annotation. Thus, the sentiment of a cluster is simply the sentiment of its targets. The distribution of the number of valid targets in each cluster is depicted in Figure 3c. As can be seen, the majority of clusters contain a single target. Out of the 31% of clusters that contain two targets, 70% follow the pattern "the/this/a/their <T>" for some term T, e.g., color and the color. The larger clusters of 4 or more targets (2% of all clusters), mostly stem from conjunctions or lists of targets (see examples in Appendix C).
The distribution of the number of clusters identified in each sentence is depicted in Figure 3d. Around 40% of the sentences have one cluster identified within, and as many as 40% have two or more clusters (for OPINOSIS). Between 20% to 35% of the sentences contain no clusters, i.e. no term with a sentiment expressed towards it was detected. Exploring the connection between the number of identified clusters and properties of the annotated sentences (e.g., length) is an interesting direction for future work.
Summary Table 2 summarizes the statistics of the collected data. It also shows the average pairwise inter-annotator agreement, computed with Cohen's Kappa (Cohen, 1960), which was in the range considered as moderate agreement (substantial for SE14-R) by Landis and Koch (1977).
Overall, the YASO dataset contains 2215 sentences and 7415 annotated target candidates. Several annotated sentences are exemplified in Appendix C. To enable further analysis, the dataset includes all candidate targets, not just valid ones, each marked with its confidence, sentiment label (including raw annotation counts), and span. YASO is released along with code for performing the post-processing steps described above, and computing the evaluation metrics presented in §6.

Analysis
Next, three questions pertaining to the collected data and its annotation scheme are explored. Is the sentiment annotation phase mandatory?
Recall that each sentence in the target candidates annotation phase was shown to 5 annotators who chose candidates and their sentiments. As a result, each candidate has 1 to 5 "first-phase" sentiment answers that can be aggregated by majority vote to a detection-phase sentiment label. These can be compared with the sentiment labels from the sentiment annotation phase (which are always based on ≥7 answers). The distribution of the number of answers arising from the detection-phase labeling is depicted in Figure 4a. In most cases, only one or two answers were available (e.g., in ≥80% of cases for YELP). Figure 4b further details how many of them were correct; for example, those based on one answer for YELP were correct in <50% of cases. In such cases, the sentiment annotation phase is essential for obtaining the correct label. On the other hand, when based on three or more answers, the detectionphase sentiments were correct in ≥96% of cases, for all datasets. Such cases may be exempt from the second sentiment annotation phase, thus reducing costs in future annotation efforts.
What are the differences from SE14? The collected clusters for sentences sampled from SE14 were compared with the SE14 original annotations by pairing each cluster, based solely on its span, with overlapping SE14 annotations (excluding SE14 neutral labels), when available. The sentiments within each pair were compared, and, in most cases, were found to be identical (see Table 3). Table 3 further shows many clusters are exclusively present in YASO -they do not overlap any SE14 annotation. A manual analysis of such clus- ters revealed only a few were annotation errors (see Table 4). The others were of one of these categories: (i) Entities, such as company/restaurant names; (ii) Product terms like computer or restaurant; (iii) Other terms that are not product aspect, such as decision in "I think that was a great decision to buy"; (iv) Indirect references, including pronouns, such as It in "It was delicious!". This difference is expected as such terms are by construction excluded from SE14. In contrast, they are included in YASO since by design it includes all spans people consider as having a sentiment. This makes YASO more complete, while enabling those interested to discard terms as needed for downstream applications. The per-domain frequency of each category, along with additional examples, is given in Table 4. A similar analysis performed on the 20 targets that were exclusively found in SE14 (i.e., not paired with any of the YASO clusters), showed that 8 cases were SE14 annotation errors, some due to complex expressions with an implicit or unclear sentiment. For example, in "They're a bit more expensive then typical, but then again, so is their food.", the sentiment of food is unclear (and labeled as positive in SE14). From the other 12

Category L R Examples
Entities 14 6 Apple, iPhone, Culinaria Product 13 6 laptop, this bar, this place Other 10 11 process, decision, choice Indirect 24 11 it, she, this, this one, here Error 3 4 - cases not paired with any cluster, three were YASO annotation errors (i.e. not found through our annotation scheme), and the rest were annotated but with low-confidence.
What is the recall of the target candidates annotation phase? The last comparison also shows that of the 156 targets 10 annotated in SE14 within the compared sentences, 98% (153) were detected as target candidates, suggesting that our target candidates annotation phase achieved good recall.

Benchmark Results
Recall the main purpose of YASO is cross-domain evaluation. The following results were obtained by training on data from SE14 (using its original training sets), and predicting targets over YASO sentences. The results are reported for the full TSA task, and separately for the TE and SC subtasks.
Baselines The following five recently proposed TSA systems were reproduced using their available codebases, and trained on the training set of each of the SE14 domains, yielding ten models overall. Evaluation Metrics As a pre-processing step, any predicted target with a span equal to the span of a target candidate annotated with low-confidence was excluded from the evaluation, since it is unclear what is its true label. The use of clusters within the evaluation requires an adjustment of the computed recall. Specifically, multiple predicted targets contained within one cluster should be counted once, considering the cluster as one true positive. Explicitly, a predicted target and a cluster are span-matched, if the cluster contains a valid target with a span equal to the span of the prediction (an exact span match). Similarly, they are fully-matched if they are span-matched and their sentiments are the same. Predictions that were not span-matched to any cluster were considered as errors for the TE task (since their span was not annotated as a valid target), and those that were not fully-matched to any cluster were considered as errors for the full task. Using span-matches, precision for the TE task is the percentage of spanmatched predictions, and recall is the percentage of span-matched clusters. These metrics are similarly defined for the full task using full-matches.
For SC, evaluation was restricted to predictions that were span-matched to a cluster. For a sentiment label l, precision is the percentage of fully-matched predictions with sentiment l (out of all span-matched predictions with that sentiment); recall is the percentage of fully-matched clusters with sentiment l (out of all span-matched clusters with that sentiment). Macro-F 1 (mF 1 ) is the average F 1 over the positive and negative sentiment labels (mixed was ignored since it was scarcely in the data, following Chen and Qian (2020)).
Our data release is accompanied by code for computing all the described evaluation metrics.
Results Table 5 presents the results of our evaluation. BAT trained on the restaurants data was the best-performing system for TE and the full TSA tasks, on three of the four datasets (YELP, SST and OPINOSIS). For SC, BERT-E2E was the best model on three datasets. Generally, results for SC were relatively high, while TE results by some models may be very low, typically stemming from low recall. The precision and recall results for each task are further detailed in Appendix D.
Appendix D also details additional results when relaxing the TE evaluation criterion from exact span-matches to overlapping span-matches -where a predicted target and a cluster are span-matched if their spans overlap. While with this relaxed evaluation the TE performance was higher (as expected), the absolute numbers suggest a significant percentage of errors were not simply targets predicted with a misaligned span.
TSA task performance was lowest for SST, perhaps due to its domain of movie reviews, which is furthest of all datasets from the product reviews training data. Interestingly, it was also the dataset with the lowest level of agreement among humans (see Figure 3a).
The choice of the training domain is an important factor for most algorithms. This is notable, for example, in the TE performance obtained for YELP: the gap between training on data from the laptops domain or the restaurants domain is ≥ 20 (in favor of the latter) for all algorithms (except LCF). A likely cause is that the YASO data sampled from YELP has a fair percentage of reviews on food related establishments. Future work may further use YASO to explore the impact of the similarity between the training and test domains, as well as develop new methods that are robust to the choice of the training domain.

Conclusion
We collected a new open-domain user reviews TSA evaluation dataset named YASO. Unlike existing review datasets, YASO is not limited to any particular reviews domain, thus providing a broader perspective for cross-domain TSA evaluation. Benchmark results established in such a setup with contemporary TSA systems show there is ample headroom for improvement on YASO. YASO was annotated using a new scheme for creating TSA labeled data, that can be also applied to non-review texts. The reliability of the annotations obtained by this scheme has been verified through a manual analysis of a sample and a comparison to existing labeled data.
One limitation of our scheme is that aspect categories with a sentiment implied from the reviews were excluded, since their annotation requires prespecifying the domain along with its associated categories. While this may limit research for some applications, the dataset is useful in many realworld use cases. For example, given a brand name, one may query a user reviews corpus for sentences containing it, and analyze the sentiment towards that brand in each sentence along with the sentiment expressed to other terms in these sentences.
Future work may improve upon the presented results by training on multiple domains or datasets, adapting pre-trained models to the target domains in an unsupervised manner (e.g., Rietzler et al. (2020)), exploring various data augmentation techniques, or utilizing multi-task or weak-supervision algorithms. Another interesting direction for further research is annotating opinion terms within the YASO sentences, facilitating their co-extraction with corresponding targets (Wang et al., 2016(Wang et al., , 2017b, or as triplets of target term, sentiment, and opinion term (Peng et al., 2020;Xu et al., 2020b).
All benchmark data collected in this work are available online. 17 We hope that these data will facilitate further advancements in the field of targeted sentiment analysis. Below are the guidelines for the labeling task of detecting potential targets and their sentiment.

General instructions
In this task you will review a set of sentences. Your goal is to identify items in the sentences that have a sentiment expressed towards them.
Steps 1. Read the sentence carefully.
2. Identify items that have a sentiment expressed towards them.
3. Mark each item, and for each selection choose the expressed sentiment: (a) Positive: the expressed sentiment is positive.
(b) Negative: the expressed sentiment is negative.
(c) Mixed: the expressed sentiment is both positive and negative.
4. If there are no items with a sentiment expressed towards them, proceed to the next sentence.

Rules & Tips
• Select all items in the sentence that have a sentiment expressed towards them.
• It could be that there are several correct overlapping selections. In such cases, it is OK to choose only one of these overlapping selections.
• The sentiment towards a selected item(s) should be expressed from other parts of the sentence, it cannot come from within the selected item (see Example #2 below).
• Under each question is a comments box. Optionally, you can provide question-specific feedback in this box. This may include a rationalization of your choice, a description of an error within the question or the justification of another answer which was also plausible. In general, any relevant feedback would be useful, and will help in improving this task.

Examples
Here are a few example sentences, categorized into several example types.

Long selected items
There is no restriction on the length of a select item, so long as there is an expressed sentiment towards it in the sentence (which does not come from within the marked item). Note: It is also a valid choice to select food along with its details description: food from the Italian restaurant near my office, or add the prefix The to the selection (or both). The selection must be a coherent phrase. food from the is not a valid selection. Since these selections all overlap, it is OK to select one of them.

B.2 Sentiment Annotation
Below are the guidelines for labeling the sentiment of identified target candidates.

General instructions
In this task you will review a set of sentences, each containing one marked item. Your goal is to determine the sentiment expressed in the sentence towards the marked item.
Steps 1. Read the sentence carefully.
2. Identify the sentiment expressed in the sentence towards the marked item, by selecting one of these four options: (a) Positive: the expressed sentiment is positive.
(b) Negative: the expressed sentiment is negative.
(c) Mixed: the expressed sentiment is both positive and negative.
(d) None: there is no sentiment expressed towards the item.
3. If there are no items with a sentiment expressed towards them, proceed to the next sentence.

Rules & Tips
• The sentiment should be expressed towards the marked item, it cannot come from within the marked item (see Example #2 below).
• A sentence may appear multiple times, each time with one marked item. Different marked items may have different sentiments expressed towards each of them in one sentence (see Example #3 below) • Under each question is a comments box. Optionally, you can provide question-specific feedback in this box. This may include a rationalization of your choice, a description of an error within the question or the justification of another answer which was also plausible. In general, any relevant feedback would be useful, and will help in improving this task.

Different marked items in one sentence
Example #3.1: The food was good, but the atmosphere was awful. Answer: Positive The food was good, but the atmosphere was awful. Answer: Negative Example #3.2: The camera has excellent lens.

Answer: Positive
The camera has excellent lens. Answer: Positive Example #3.3: My new camera has excellent lens, but its price is too high. Answer: Mixed Explanation: There is a positive sentiment towards the camera, due to its excellent lens, and also a negative sentiment, because its price is too high, so the correct answer is Mixed.
My new camera has excellent lens, but its price is too high.

Answer: Positive
My new camera has excellent lens, but its price is too high. Answer: Negative

Marked items without a sentiment
Below are some examples of marked items without an expressed sentiment in the sentence. In cases where there is a expressed sentiment towards other words in the same sentence, it is exemplified as well.
Example #4.1: Microwave, refrigerator, coffee maker in room. Answer: None Example #4.2: Note that they do not serve beer, you must bring your own. Answer: None Example #4.3: The cons are more annoyances that can be lived with. Answer: None Explanation: While the marked item contains a negative sentiment, there is no sentiment towards the marked item.
Example #4.4: working with Mac is so much easier, so many cool features. Answer: None working with Mac is so much easier, so many cool features. Answer: Positive working with Mac is so much easier, so many cool features. Answer: Positive Example #4.5: The battery life is excellent-6-7 hours without charging.

Answer: None
The battery life is excellent-6-7 hours without charging. Answer: Positive Example #4.6: I wanted a computer that was quiet, fast, and that had overall great performance. Answer: None 5. "the" can be a part of a marked item I feel a little bit uncomfortable in using the Mac system. Answer: Negative I feel a little bit uncomfortable in using the Mac system. Answer: Negative I feel a little bit uncomfortable in using the Mac system. Answer: None

Long marked items
There is no restriction on the length of a marked item, so long as there is an expressed sentiment towards it in the sentence (which does not come from within the marked item).
The food from the Italian restaurant near my office was very good. Answer: Positive The food from the Italian restaurant near my office was very good. Answer: Positive The food from the Italian restaurant near my office was very good. Answer: None

Idioms
A sentiment may be conveyed with an idiom -be sure you understand the meaning of an input sentence before answering. When unsure, look up potential idioms online.
The laptop's performance was in the middle of the pack, but so is its price. Answer: None Explanation: in the middle of the pack does not convey a positive nor a negative sentiment, and certainly not both (so the answer is not "mixed" as well). Table 6 presents sentences included in YASO, along with the annotated targets and their corresponding sentiments found within each sentence.

C Annotation Examples
A target t that has a positive sentiment expressed towards it is marked as [t] P . Similarly [t] N is used for a negative sentiment. For brevity, the examples only show the valid targets annotated within the sentences, hiding any low-confidence annotations or target candidates that were annotated as not having a sentiment in the second annotation phase. As can be seen in the examples, annotated valid targets may overlap, demonstrating the need for the definition of the target clusters. Table 7 further exemplifies sentences in which a cluster containing more than 4 valid targets were detected.

D Detailed Benchmark Results
In addition to the main benchmark results presented in the paper, Table 8 shows the precision, recall and F 1 for target extraction and the entire task. For sentiment classification, the same metrics are separately reported for the positive and negative sentiment labels, as well as macro-F 1 over these two classes. Table 9 presents results similar to Table 5 with another TE evaluation criteria, where a predicted target and a cluster are span-matched if their spans overlap. This is a more relaxed evaluation criteria than the one used in the main results (which consider a predicted target and a cluster as spanmatched if the cluster contains a target with a span equal to the span of the prediction).

Positive Negative
Dat. System Train P R F 1 P R F 1 P R F 1 mF 1 P R F 1 Y