Structured Sentiment Analysis as Dependency Graph Parsing

Structured sentiment analysis attempts to extract full opinion tuples from a text, but over time this task has been subdivided into smaller and smaller sub-tasks, e.g., target extraction or targeted polarity classification. We argue that this division has become counterproductive and propose a new unified framework to remedy the situation. We cast the structured sentiment problem as dependency graph parsing, where the nodes are spans of sentiment holders, targets and expressions, and the arcs are the relations between them. We perform experiments on five datasets in four languages (English, Norwegian, Basque, and Catalan) and show that this approach leads to strong improvements over state-of-the-art baselines. Our analysis shows that refining the sentiment graphs with syntactic dependency information further improves results.


Introduction
Structured 1 sentiment analysis, i.e., the task of predicting a structured sentiment graph like the ones in Figure 1, can be theoretically cast as an information extraction problem in which one attempts to find all of the opinion tuples O = O i , . . . , O n in a text. Each opinion O i is a tuple (h, t, e, p) where h is a holder who expresses a polarity p towards a target t through a sentiment expression e, implicitly defining pairwise relationships between elements of the same tuple. Liu (2012) argues that all of these elements 2 are essential to fully resolve the sentiment analysis problem.
However, most research on sentiment analysis focuses either on a variety of sub-tasks, which avoids performing the full task, or on simplified and idealized tasks, e.g., sentence-level binary polarity classification.
We argue that the division of structured sentiment into these sub-tasks has become counterproductive, as reported experiments are often not sensitive to whether a given addition to the pipeline improves the overall resolution of sentiment, or do not take into account the inter-dependencies of the various sub-tasks. As such, we propose a unified approach to structured sentiment which jointly predicts all elements of an opinion tuple and their relations. Moreover, we cast sentiment analysis as a dependency graph parsing problem, where the sentiment expression is the root node, and the other elements have arcs which model the relationships between them. This methodology also enables us to take advantage of recent improvements in semantic dependency parsing (Dozat and Manning, 2018;Oepen et al., 2020;Kurtz et al., 2020) to efficiently learn a sentiment graph parser.
This perspective also allows us to unify a number of approaches, including targeted, and opinion tuple mining. We aim to answer RQ1: whether graph-based approaches to structured sentiment outperform state-of-the-art sequence labeling approaches, and RQ2: how to best encode structured sentiment as parsing graphs. We perform experiments on five standard datasets in four languages (English, Norwegian, Basque, Catalan) and show that graph-based approaches outperform state-ofthe-art baselines on all datasets on several standard metrics, as well as our proposed novel (unlabeled and labeled) sentiment graph metrics. We further propose methods to inject linguistic structure into the sentiment graphs using syntactic dependencies. Our main contributions are therefore 1) proposing a holistic approach to structured sentiment through  Figure 1: A structured sentiment graph is composed of a holder, target, sentiment expression, their relationships and a polarity attribute. Holders and targets can be null. sentiment graph parsing, 2) introducing new evaluation metrics for measuring model performance, and 3) extensive experimental results that outperform state-of-the-art baselines. Finally, we release the code and datasets 3 to enable future work on this problem.

Related Work
Structured sentiment analysis can be broken down into five sub-tasks: i) sentiment expression extraction, ii) sentiment target extraction, iii) sentiment holder extraction, iv) defining the relationship between these elements, and v) assigning polarity. Previous work on information extraction has used pipeline methods which first extract the holders, targets, and expressions (tasks iiii) and subsequently predict their relations (task iv), mostly on the MPQA dataset (Wiebe et al., 2005). CRFs and a number of external resources (sentiment lexicons, dependency parsers, named-entity taggers) (Choi et al., 2006;Yang and Cardie, 2012) are strong baselines. Given the small size of the training data and the complicated task, these techniques often still outperform neural models, such as BiLSTMs (Katiyar and Cardie, 2016). Transition-based end-toend approaches have shown some potential (Zhang et al., 2019). However, all of this work ignores the polarity classification subtask.
End2End sentiment analysis is a recently proposed subtask which combines targeted sentiment (tasks ii and v) and sentiment expression extraction (task i), without requiring the resolution of relationships between targets and expressions. Wang et al. (2016) augment the ABSA datasets with sentiment expressions, but provide no details on the annotation process or any inter-annotator agreement. He et al. (2019) make use of this data and propose a multi-layer CNN (IMN) to create hidden representations h which are then fed to a target and opinion extraction module (AE), which is also a multi-layer CNN. This module predictsŷ ae , a sequence of BIO tags 4 that predict the presence or absence of targets and expressions. After jointly predicting the targets and expressions, a second multi-layer CNN with a final self-attention network is used to classify the polarity, again as sequence labeling task (AS). This second module combines the information from h andŷ ae by incorporating the predicted probability of a token to be a target in the formulation of self-attention. Finally, an iterative message-passing algorithm updates h using the predictions from all the modules at the previous timestep.
Chen and Qian (2020) instead propose Relation-Aware Collaborative Learning (RACL). This model creates task specific representations by first embedding a sentence, passing through a shared feed-forward network and finally a task-specific CNN. This approach then models interactions between each pair of sub-tasks (target extraction, expression extraction, sentiment classification) by creating pairwise weighted attention representations. These are then concatenated and used to create the task-specific predictions. The authors finally stack several RACL layers, using the output from the previous layer as input for the next.
Both models perform well on the augmented Se-mEval data, but it is unlikely that these annotations are adequate for full structured sentiment, as Wang et al. (2016) only provide expression annotations for sentences that have targets, generally only include sentiment-bearing words (not phrases), and do not specify the relationship between target and expression.
Finally, the recently proposed aspect sentiment triplet extraction (Peng et al., 2019;?) attempts to extract targets, expressions and their polarity. However, the datasets used are unlikely to be adequate, as they augment available targeted datasets, but do not report annotation guidelines, procedure, or inter-annotator agreement.
Graph parsing: Syntactic dependency graphs are regularly used in applications, supplying them with necessary grammatical information (Mintz et al., 2009;Cui et al., 2005;Björne et al., 2009;Johansson and Moschitti, 2012;Lapponi et al., 2012). The dependency graph structures used in these systems are predominantly restricted to trees. While trees are sufficient to encode syntactic dependencies, they are not expressive enough to handle meaning representations, that require nodes to have multiple incoming arcs, or having no incoming arcs at all (Kuhlmann and Oepen, 2016). While much of the early research on parsing these new structures (Oepen et al., 2014(Oepen et al., , 2015 focused on specialized decoding algorithms, Dozat and Manning (2018) presented a neural dependency parser that essentially relies only on its neural network structure to predict any type of dependency graph without restrictions to certain structures. Using the parser's ability to learn arbitrary dependency graphs, Kurtz et al. (2020) phrased the task of negation resolution (Morante and Blanco, 2012;Morante and Daelemans, 2012) as a graph parsing task. This transformed the otherwise flat representations to dependency structures that directly encode the often overlapping relations between the building blocks of multiple negation instances at the same time. In a simpler fashion, Yu et al. (2020) exploit the parser of Dozat and Manning (2018) to predict spans of named entities.

Datasets
We here focus on datasets that annotate the full task of structured sentiment as described initially. We perform experiments on five structured sentiment datasets in four languages, the statistics of which are shown in Table 1. The largest available structured sentiment dataset is the NoReC Fine dataset (Øvrelid et al., 2020), a multi-domain dataset of professional reviews in Norwegian, annotated for structured sentiment. MultiB EU and MultiB CA (Barnes et al., 2018) are hotel reviews in Basque and Catalan, respectively. MPQA (Wiebe et al., 2005) annotates news wire text in English. Finally, DS Unis (Toprak et al., 2010) annotate English reviews of online universities and e-commerce. In our experiments, we use only the university reviews, as the e-commerce reviews have a large number of 'polar targets', i.e., targets with a polarity, but no accompanying sentiment expression.
While all the datasets annotate holders, targets, and expressions, the frequency and distribution of these vary. Regarding holders, MPQA has the most (2,054) and DS Unis has the fewest (94), whereas NoReC Fine has the largest proportion of targets (8,923) and expressions (11,115). The average length of holders (2.6 tokens) and targets (6.1 tokens) in MPQA is also considerably higher than the others.
It is also worth pointing out that MPQA and DS Unis additionally include neutral polarity. In the case of MPQA the neutral class refers to verbs which are subjective but do not convey polarity, e.g., 'say', 'opt for'. In DS Unis , however, the neutral label tends to indicate expressions that could entail mixed polarity or are polar under the right conditions, e.g., 'the classes were not easy' is considered neutral, as it is possible for difficult classes to be desirable at a university. MultiB EU , and MultiB CA also have labels for strong positive and strong negative, which we map to positive and negative, respectively. Finally, NoReC Fine includes intensity annotations (strong, normal, slight), which we disregard for the purposes of these experiments.

Modeling
This section describes how we define and encode sentiment graphs, detail the neural dependency graph models, as well as two state-of-the-art baselines for end-to-end sentiment analysis (target and expression extraction, plus polarity classification).

Graph Representations
Structured sentiment graphs as in Figure 1 are directed graphs, that are made up of a set of labeled nodes and a set of unlabeled edges connecting pairs of nodes. Nodes in the structured sentiment graphs can span over multiple tokens and may have multiple incoming edges. The resulting graphs can have multiple entry points (roots), are not necessarily connected, and not every token is a node in the graph. The sentence's sentiment expressions correspond to the roots of the graphs, connecting explicitly to their respective holders and targets. In order to apply the algorithm of Dozat and Manning (2018), we simplify these structures into bi-lexical dependency graphs visualized in Figure 2. Here, nodes correspond one-to-one to the tokens of the sequence and follow the same linear order. The edges are drawn as arcs in the half-plane above the sentence, connecting heads to dependents. Similarly to the source structures, the graphs can have multiple roots and nodes can have multiple or no incoming arcs. For some rare instances of structured sentiment graphs, the reduction to dependency graphs is lossy, as they do not allow multiple arcs to share the same head and dependent. This results in a slight mismatch of the learned and aimed-for representations.
The choice of how to encode the sentiment graphs as parsing graphs opens for several alternate representations depending on the choice of head/dependent status of individual tokens in the target/holder/expression spans of the sentiment graph. We here propose two simple parsing graph representations: head-first and head-final, which Graph arcs Exact Yes NSF1 Sentimentgraph Exact graph, partial token

Sentimentgraph
Exact graph, partial token Yes Table 2: Metrics used to evaluate performance. Column +/− indicates whether polarity is included or not. The main metrics are Targeted F 1 , which allows us to compare to methods that do not perform the full task, and SF 1 , which best represents the full task.
are shown in Figure 2. For head-first, we set the first token of the sentiment expression as a root node, and similarly set the first token in each holder and token span as the head of the span with all other tokens within that span as dependents. The labels simply denote the type of relation (target/holder) and for sentiment expressions, additionally encode the polarity. Head-final is similar, but instead sets the final token of spans as the heads, and the final token of the sentiment expression as the root node.

Proposed model
The neural graph parsing model used in this work is a reimplementation of the neural parser by Dozat and Manning (2018) which was used by Kurtz et al. (2020) for negation resolution. The parser learns to score each possible arc to then finally predict the output structure simply as a collection of all positively scored arcs. The base of the network structure is a bidirectional LSTM (BiLSTM), that processes the input sentence both from left-toright and right-to-left, to create contextualized representations c 1 , . . . , c n = BiLSTM(w 1 , . . . , w n ) where w i is the concatenation of a word embedding, POS tag embedding, lemma embedding, and character embedding created by a character-based LSTM for the ith token. In our experiments, we further augment the token representations with pretrained contextualized embeddings from multilingual BERT (Xu et al., 2019). We use multilingual BERT as several languages did not have available monolingual BERT models at the time of the experiments (Catalan, Norwegian).
The contextualized embeddings are then processed by two feedforward neural networks (FNN), creating specialized representations for potential heads and dependents, h i = FNN head (c i ) and d i = FNN dep (c i ). The scores for each possible arclabel combination are computed by a final bilinear transformation using the tensor U . Its inner dimension corresponds to the number of sentiment graph labels plus a special NONE label, indicating the ab-sence of an arc, which allows the model to predict arcs and labels jointly,

Baselines
We compare our proposed graph prediction approach with three state-of-the-art baselines 5 for extracting targets and expressions and predicting the polarity: IMN 6 , RACL 7 , as well as RACL-BERT, which also incorporates contextualized embeddings. Instead of using BERT Large , we use the cased BERT-multilingual-base in order to fairly compare with our own models. Note, however, that our model does not update the mBERT representations, putting it at a disadvantage to RACL-BERT. We also compare with previously reported extraction results from Barnes et al. (2018) and Øvrelid et al. (2020).

Evaluation
As we are interested not only in extraction or classification, but rather in the full structured sentiment task, we propose metrics that capture the relations between all predicted elements, while enabling comparison with previous state-of-the-art models on different subtasks. The main metrics we use to rank models are Targeted F 1 and Sentiment Graph F 1 .  Table 3: Experiments comparing our sentiment graph approaches (Head-first/Head-final) using mBERT with the sequence-labeling baselines (IMN, RACL, RACL-BERT). Underlined numbers indicate the best result for the metric and dataset. * indicates approach is significantly better than second best (p < 0.05), as determined by a bootstrap with replacement test. † indicates results that are not comparable, as they were calculated with 10-fold cross-validation.
Token-level F 1 for Holders, Targets, and Expressions To easily compare our models to pipeline models, we evaluate how well these models are able to identify the elements of a sentiment graph with token-level F 1 .
Targeted F 1 This is a common metric in targeted sentiment analysis (also referred to as F 1 -i (He et al., 2019) or ABSA F 1 (Chen and Qian, 2020)). A true positive requires the combination of exact extraction of the sentiment target, and the correct polarity.
Parsing graph metrics We additionally compute graph-level metrics to determine how well the models predict the unlabeled and labeled arcs of the parsing graphs: Unlabeled F 1 (UF 1 ), Labeled F 1 (LF 1 ). These measure the amount of (in)correctly predicted arcs and labels, as the harmonic mean of precision and recall (Oepen et al., 2014). These metrics inform us of the local properties of the graph, and do not overly penalize a model if a few edges of a graph are incorrect.

Sentiment graph metrics
The two metrics that measure how well a model is able to capture the full sentiment graph (see Figure 1) are Non-polar Sentiment Graph F 1 (NSF 1 ) and Sentiment Graph F 1 (SF 1 ). For NSF 1 , each sentiment graph is a tuple of (holder, target, expression), while for SF 1 we include polarity (holder, target, expression, polarity). A true positive is defined as an exact match at graph-level, weighting the overlap in predicted and gold spans for each element, averaged across all three spans. For precision we weight the number of correctly predicted tokens divided by the total number of predicted tokens (for recall, we divide instead by the number of gold tokens). We allow for empty holders and targets.

Experiments
All sentiment graph models use token-level mBERT representations in addition to word2vec skip-gram embeddings openly available from the NLPL vector repository 8 (Fares et al., 2017). We train all models for 100 epochs and keep the model that performs best regarding LF 1 on the dev set (Targeted F 1 for the baselines). We use default hyperparameters from Kurtz et al. (2020) (see Appendix) and run all of our models five times with different random seeds and report the mean (standard deviation shown as well in Table 8 in the Appendix). We calculate statistical difference between the best and second best models through a bootstrap with replacement test (Berg-Kirkpatrick et al., 2012). As there are 5 runs, we require that 3 of 5 be statistically significant at p < 0.05. Table 3 shows the results for all datasets.
On NoReC Fine , the baselines IMN, RACL, and RACL-BERT perform well at extracting targets (35.9, 45.6, and 47.2 F 1 , respectively) and expressions (48.7/55.4/56.3), but struggle with the full targeted sentiment task (18.0/20.1/30.3). The graphbased models extract targets better (50.1/54.8) and have comparable scores for expressions (54.4/55.5). The holder extraction scores have a similar range (51.1/60.4). These patterns hold throughout the other datasets, where the proposed graph models nearly always perform best on extracting spans, although RACL-BERT achieves the best score on extracting targets on DS Unis (44.6 vs. 42.1). The graph models also outperform the strongest baseline (RACL-BERT) on targeted sentiment on all 5 datasets, although this difference is often not statistically significant (NoReC Fine Head-first, MultiB EU Head-final) and RACL-BERT is better than Head-first on DS Unis .
Regarding the Graph metrics, the results depend highly on the dataset, with UF  On average IMN is the weakest baseline, followed by RACL and then RACL-BERT. The main improvement that RACL-BERT gives over RACL on these datasets is seen in the Targeted metric, i.e., the contextualized representations improve the polarity classification more than the extraction task. The proposed graph-based models are consistently the best models across the metrics and datasets.
Regarding graph representations, the differences between Head-first and Head-final are generally quite small. Head-first performs better on MultiB CA and slightly better on MultiB EU , while for the others (NoReC Fine , MPQA, and DS Unis ) Head-final is better. This suggests that the main benefit is the joint prediction of all spans and relationships, and that the specific graph representation matters less.

Analysis
In this section we perform a deeper analysis of the models in order to answer the research questions.

Do syntactically informed sentiment graphs improve results?
Our two baseline graph representations, Head-first and Head-final, are crude approximations of linguistic structure. In syntactic and semantic dependency graphs, heads are often neither the first or last word, but rather the most salient word according to various linguistic criteria. First, we enrich the dependency labels to distinguish edges that are internal to a holder/target/expression span from those that are external and perform experiments by adding an 'in label' to non-head nodes within the graph, which we call +inlabel. We further inform the head selection of the parsing graphs with syntactic information in the Dep. edges parsing   graphs, where we compute the dependency graph for each sentence 9 and set the head of each span to be the node that has an outgoing edge in the corresponding syntactic graph. As there can be more than one such edge, we default to the first. A manual inspection showed that this approach sometimes set unlikely dependency label types as heads, e.g., punct, obl. Therefore, we suggest a final approach, Dep. labels, which filters out these unlikely heads. The full results are shown in Table 8 in the Appendix. The implementation of the graph structure has a large effect on all metrics, although the specific results depend on the dataset. We plot the average effect of each implementation across all datasets in Figure 3, as well as each individual dataset (Figures 4-8 in the Appendix). +inlabel tends to improve results on the non-English datasets, consistently increasing target and expression extraction and targeted sentiment. It also generally improves the graph scores UF 1 and LF 1 on the non-English datasets.
Dep. edges has the strongest positive effect on the NSF 1 and SF 1 (an avg. 2.52 and 2.22 percentage point (pp) over Head-final, respectively). However, this average is pulled down by poorer performance on the English datasets. Removing these two, the average benefit is 5.2 and 4.2 for NSF 1 and SF 1 , respectively. On span extraction and targeted sentiment, however, Dep. edges leads to poorer scores overall. Dep. labels does not lead to any consistent improvements. These results indicate that incorporating syntactic dependency information is particularly helpful for the full structured sentiment task, but that these benefits do not always show at a more local level, i.e., span extraction.

Do graph models perform better on sentences with multiple targets?
We hypothesize that predicting the full sentiment graph may have a larger effect on sentences with multiple targets. Therefore, we create a subset of the test data containing sentences with multiple targets and reevaluate Head-first, Head-final, and RACL-BERT on the target extraction task. Table 4 shows the number of sentences with multiple targets and the Target span extraction score for each model. On this subset, Head-first and Head-final outperform RACL-BERT on 9 of 10 experiments, confirming the hypothesis that the graph models improve on examples with multiple targets.

How much does mBERT contribute?
We also perform experiments without mBERT (shown in Table 7 in the Appendix) and show the average gains (over all 6 graph setups) of including it in  largest for the English datasets (MPQA, DS Unis ) followed by NoReC Fine , and finally MultiB CA and MultiB EU . This corroborates the bias towards English and similar languages that has been found in multilingual language models (Artetxe et al., 2020;Conneau et al., 2020) and motivates the need for language-specific contextualized embeddings.

Analysis of polarity predictions
In this section we zoom in on polarity, in order to quantify how well models perform at predicting only polarity. As the polarity annotations are bound to the expressions, we consider true positives to be any expression that overlaps the gold expression and has the same polarity. Table 6 shows that the polarity predictions are best on and MultiB CA , followed by NoReC Fine and DS Unis , and finally MPQA. This is likely due to the number of domains and characteristics of the data. NoReC Fine contains many domains and has longer expressions, while MPQA contains many highly ambiguous polar expressions, e.g., 'said', 'asked', which have different polarity depending on the context.

Conclusion
In this paper, we have proposed a dependency graph parsing approach to structured sentiment analysis and shown that these models outperform state-of-the-art sequence labeling models on five benchmark datasets. Using parse trees as input has shown promise for sentiment analysis in the past, either to guide a tree-based algorithm (Socher et al., 2013;Tai et al., 2015) or to create features for sentiment models (Nakagawa et al., 2010;Almeida et al., 2015). However, to the authors' knowledge, this is the first attempt to directly predict dependencybased sentiment graphs.
In the future, we would like to better exploit the similarities between dependency parsing and sentiment graph parsing, either by augmenting the token-level representations with contextualized vectors from their heads in a dependency tree (Kurtz et al., 2020) or by multi-task learning to dependency parse. We would also like to explore different graph parsing approaches, e.g., PERIN (Samuel and Straka, 2020