Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction

Aspect Sentiment Triplet Extraction (ASTE) is the most recent subtask of ABSA which outputs triplets of an aspect target, its associated sentiment, and the corresponding opinion term. Recent models perform the triplet extraction in an end-to-end manner but heavily rely on the interactions between each target word and opinion word. Thereby, they cannot perform well on targets and opinions which contain multiple words. Our proposed span-level approach explicitly considers the interaction between the whole spans of targets and opinions when predicting their sentiment relation. Thus, it can make predictions with the semantics of whole spans, ensuring better sentiment consistency. To ease the high computational cost caused by span enumeration, we propose a dual-channel span pruning strategy by incorporating supervision from the Aspect Term Extraction (ATE) and Opinion Term Extraction (OTE) tasks. This strategy not only improves computational efficiency but also distinguishes the opinion and target spans more properly. Our framework simultaneously achieves strong performance for the ASTE as well as ATE and OTE tasks. In particular, our analysis shows that our span-level approach achieves more significant improvements over the baselines on triplets with multi-word targets or opinions.


Introduction
Aspect-Based Sentiment Analysis (ABSA) (Liu, 2012;Pontiki et al., 2014) is an aggregation of several fine-grained sentiment analysis tasks, and its various subtasks are designed with the aspect target as the fundamental item. For the example in Figure 1: An example of ASTE. The spans highlighted in orange are target terms, and the span in blue is opinion term. The "-" on top of target terms indicates negative sentiment. Figure 1, the aspect targets are "Windows 8" and "touchscreen functions". Aspect Sentiment Classification (ASC) (Dong et al., 2014;Zhang et al., 2016;Li et al., 2018a;Tang et al., 2019) is one of the most well-explored subtasks of ABSA and aims to predict the sentiment polarity of a given aspect target. However, it is not always practical to assume that the aspect target is provided. Aspect Term Extraction (ATE) (Yin et al., 2016;Li et al., 2018b;Ma et al., 2019) focuses on extracting aspect targets, while Opinion Term Extraction (OTE) (Yang and Cardie, 2012;Klinger and Cimiano, 2013;Yang and Cardie, 2013) aims to extract the opinion terms which largely determine the sentiment polarity of the sentence or the corresponding target term. Aspect Sentiment Triplet Extraction (ASTE) (Peng et al., 2019) is the most recently proposed subtask of ABSA, which forms a more complete picture of the sentiment information through the triplet of an aspect target term, the corresponding opinion term, and the expressed sentiment. For the example in Figure 1, there are two triplets: ("Windows 8", "not enjoy", Negative) and ("touchscreen functions", "not enjoy", Negative). The initial approach to ASTE (Peng et al., 2019) was a two-stage pipeline. The first stage extracts target terms and their sentiments via a joint labeling scheme 2 , as well as the opinion terms with stan-4756 dard BIOES 3 tags. The second stage then couples the extracted target and opinion terms to determine their paired sentiment relation. We know that in ABSA, the aspect sentiment is mostly determined by the opinion terms expressed on the aspect target (Qiu et al., 2011;Yang and Cardie, 2012). However, this pipeline approach breaks the interaction within the triplet structure. Moreover, pipeline approaches usually suffer from the error propagation problem.
Recent end-to-end approaches (Wu et al., 2020;Xu et al., 2020b;Zhang et al., 2020) can jointly extract the target and opinion terms and classify their sentiment relation. One drawback is that they heavily rely on word-to-word interactions to predict the sentiment relation for the target-opinion pair. Note that it is common for the aspect targets and opinions to contain multiple words, which accounts for roughly one-third of triplets in the benchmark datasets. However, the previous methods (Wu et al., 2020;Zhang et al., 2020) predict the sentiment polarity for each word-word pair independently, which cannot guarantee their sentiment consistency when forming a triplet. As a result, this prediction limitation on triplets that contain multi-word targets or opinions inevitably hurts the overall ASTE performance. For the example in Figure 1, by only considering the word-to-word interactions, it is easy to wrongly predict that "enjoy" expresses a positive sentiment on "Windows". Xu et al. (2020b) proposed a position-aware tagging scheme to allow the model to couple each word in a target span with all possible opinion spans, i.e., aspect word to opinion span interactions (or vice versa, aspect span to opinion word interactions). However, it still cannot directly model the span-tospan interactions between the whole target spans and opinion spans.
In this paper, we propose a span-based model for ASTE (Span-ASTE), which for the first time directly captures the span-to-span interactions when predicting the sentiment relation of an aspect target and opinion pair. Of course, it can also consider the single-word aspects or opinions properly. Our model explicitly generates span representations for all possible target and opinion spans, and their paired sentiment relation is independently predicted for all possible target and opinion pairs. Span-based methods have shown encouraging perof a target span with positive sentiment polarity. 3 A common tagging scheme for sequence labeling, denoting "begin, inside, outside, end and single" respectively. formance on other tasks, such as coreference resolution (Lee et al., 2017), semantic role labeling (He et al., 2018a), and relation extraction . However, they cannot be directly applied to the ASTE task due to different task-specific characteristics.
Our contribution can be summarized as follows: • We tailor a span-level approach to explicitly consider the span-to-span interactions for the ASTE task and conduct extensive analysis to demonstrate its effectiveness. Our approach significantly improves performance, especially on triplets which contain multiword targets or opinions.
• We propose a dual-channel span pruning strategy by incorporating explicit supervision from the ATE and OTE tasks to ease the high computational cost caused by span enumeration and maximize the chances of pairing valid target and opinion candidates together.
• Our proposed Span-ASTE model outperforms the previous methods significantly not only for the ASTE task, but also for the ATE and OTE tasks on four benchmark datasets with both BiLSTM and BERT encoders.

Task Formulation
Let X = {x 1 , x 2 , ..., x n } denote a sentence of n tokens, let S = {s 1,1 , s 1,2 , ..., s i,j , ..., s n,n } be the set of all possible enumerated spans in X, with i and j indicating the start and end positions of a span in the sentence. We limit the span length as 0 ≤ j − i ≤ L. The objective of the ASTE task is to extract all possible triplets in X. Each sentiment triplet is defined as (target, opinion, sentiment) where sentiment ∈ {P ositive, N egative, N eutral}.

Model Architecture
As shown in Figure 2, Span-ASTE consists of three modules: sentence encoding, mention module, and triplet module. For the given example, the sentence is first input to the sentence encoding module to obtain the token-level representation, from which we derive the span-level representation for each enumerated span, such as "did not enjoy", "Windows 8". We then adopt the ATE and OTE tasks to supervise our proposed dual-channel span pruning strategy which obtains the pruned target and P P -  opinion candidates, such as "Windows 8" and "not enjoy" respectively. Finally, each target candidate and opinion candidate are coupled to determine the sentiment relation between them.

Sentence Encoding
We explore two encoding methods to obtain the contextualized representation for each word in a sentence: BiLSTM and BERT.
BiLSTM We first obtain the word representations {e 1 , e 2 , ..., e i , ..., e n } from the 300dimension pre-trained GloVe (Pennington et al., 2014) embeddings which are then contextualized by a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) layer. The i th token is represented as: where − → h i and ← − h i are the hidden states of the forward and backward LSTMs respectively.
BERT An alternative encoding method is to use a pre-trained language model such as BERT (Devlin et al., 2019) to obtain the contextualized word representations x = [x 1 , x 2 , ..., x n ]. For words that are tokenized as multiple word pieces, we use mean pooling to aggregate their representations .

Span Representation
We define each span representation s i,j ∈ S as: where f width (i, j) produces a trainable feature embedding representing the span width (i.e., j − i + 1).
Besides the concatenation of the start token, end token, and width representations, the span representation s i,j can also be formed by max-pooling or mean-pooling across all token representations of the span from position i to j. The experimental results can be found in the ablation study.

Mention Module ATE & OTE Tasks
We employ the ABSA subtasks of ATE and OTE to guide our dual-channel span pruning strategy through the scores of the predicted opinion and target span. Note that the target terms and opinion terms are not yet paired together at this stage. The mention module takes the representation of each enumerated span s i,j as input and predicts the mention types m ∈ {T arget, Opinion, Invalid}.
where FFNN denotes a feed-forward neural network with non-linear activation.
Pruned Target and Opinion For a sentence X of length n, the number of enumerated spans is O(n 2 ), while the number of possible pairs between all opinion and target candidate spans is O(n 4 ) at the later stage (i.e., the triplet module). As such, it is not computationally practical to consider all possible pairwise interactions when using a spanbased approach. Previous works ) employ a pruning strategy to reduce the number of spans, but they only prune the spans to a single pool which is a mix of different mention types. This strategy does not fully consider  Table 1: Statistics of datasets. #S denotes the number of sentences. # +, # 0, and # -denote the numbers of positive, neutral, and negative sentiment triplets respectively. #SW denotes the number of triplets where both target and opinion terms are single-word spans. #MW denotes the number of triplets where at least one of the target or opinion terms are multi-word spans.
the structure of an aspect sentiment triplet as it does not recognize the fundamental difference between a target and an opinion term. Hence, we propose to use a dual-channel pruning strategy which results in two separate pruned pools of aspects and opinions. This minimizes computational costs while maximizing the chance of pairing valid opinion and target spans together. The opinion and target candidates are selected based on the scores of the mention types for each span based on Equation 3: We use the mention scores Φ target and Φ opinion to select the top candidates from the enumerated spans and obtain the target candidate pool S t = {..., s t a,b , ...} and the opinion candidate pool To consider a proportionate number of candidates for each sentence, the number of selected spans for both pruned target and opinion candidates is nz, where n is the sentence length and z is a threshold hyper-parameter. Note that although the pruning operation prevents the gradient flow back to the FFNN in the mention module, it is already receiving supervision from the ATE and OTE tasks. Hence, our model can be trained end-to-end without any issue or instability.

Triplet Module Target Opinion Pair Representation
We obtain the target-opinion pair representation by coupling each target candidate representation s t a,b ∈ S t with each opinion candidate representation s o c,d ∈ S o : where f distance (a, b, c, d) produces a trainable feature embedding based on the distance (i.e., min(|b − c|, |a − d|)) between the target and opinion spans, following (Lee et al., 2017;He et al., 2018a;Xu et al., 2020b).

Sentiment Relation Classifier
Then, we input the span pair representation g s t a,b ,s o c,d to a feed-forward neural network to determine the probability of sentiment relation r ∈ R = {P ositive, N egative, N eutral, Invalid} between the target s t a,b and the opinion s o c,d : )) (6) Invalid here indicates that the target and opinion pair has no valid sentiment relationship.

Training
The training objective is defined as the sum of the negative log-likelihood from both the mention module and triplet module.
where m * i,j is the gold mention type of the span s i,j , and r * is the gold sentiment relation of the target and opinion span pair (s t a,b , s o c,d ). S indicates the enumerated span pool; S t and S o are the pruned target and opinion span candidates.

Datasets
Our proposed Span-ASTE model is evaluated on four ASTE datasets released by Xu et al. (2020b), which include three datasets in the restaurant domain and one dataset in the laptop domain. The first version of the ASTE datasets are released by Peng et al. (2019). However, it is found that not all triplets are explicitly annotated (Xu et al., 2020b;Wu et al., 2020). Xu et al. (2020b) refined the datasets with the missing triplets and removed triplets with conflicting sentiments. Note that these  four benchmark datasets are derived from the Se-mEval Challenge (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016, and the opinion terms are retrieved from (Fan et al., 2019). Table 1 shows the detailed statistics.

Experiment Settings
When using the BiLSTM encoder, the pre-trained GloVe word embeddings are trainable. The hidden size of the BiLSTM encoder is 300 and the dropout rate is 0.5. In the second setting, we finetune the pre-trained BERT (Devlin et al., 2019) to encode each sentence. Specifically, we use the uncased version of BERT base . The model is trained for 10 epochs with a linear warmup for 10% of the training steps followed by a linear decay of the learning rate to 0. We employ AdamW as the optimizer with the maximum learning rate of 5e-5 for transformer weights and weight decay of 1e-2. For other parameter groups, we use a learning rate of 1e-3 with no weight decay. The maximum span length L is set as 8. The span pruning threshold z is set as 0.5. We select the best model weights based on the F 1 scores on the development set and the reported results are the average of 5 runs with different random seeds. 4

Baselines
The baselines can be summarized as two groups: pipeline methods and end-to-end methods.
Pipeline For the pipeline approaches listed below, they are modified by Peng et al. (2019) to extract the aspect terms together with their associated sentiments via a joint labeling scheme, and opinion terms with BIOES tags at the first stage. At the second stage, the extracted targets and opinions are then paired to determine if they can form a valid triplet. Note that these approaches employ different methods to obtain the features for the first stage. CMLA+  employs an attention mechanism to consider the interaction between aspect terms and opinion terms. RINANTE+ (Dai and Song, 2019) adopts a BiLSTM-CRF model with mined rules to capture the dependency relations. Li-unified-R (Li et al., 2019) uses a unified tagging scheme to jointly extract the aspect term and associated sentiment. Peng et al. (2019) includes dependency relation information when considering the interaction between the aspect and opinion terms.
End-to-end The end-to-end methods aim to jointly extract full triplets in a single stage. Previous work by Zhang et al. (2020) and Wu et al. (2020) independently predict the sentiment relation for all possible word-word pairs, hence they require decoding heuristics to determine the overall sentiment polarity of a triplet. JET (Xu et al., 2020b) models the ASTE task as a structured prediction problem with a position-aware tagging scheme to capture the interaction of the three elements in a triplet. Table 2 compares Span-ASTE with previous models in terms of Precision (P.), Recall (R.), and F 1 scores on four datasets. Under the F 1 metric, our model consistently outperforms the previous works for both BiLSTM and BERT sentence encoders. In most cases, our model significantly out-performs other end-to-end methods in both precision and recall. We also observe that the two strong pipeline methods Peng et al., 2019) achieved competitive recall results, but their overall performance is much worse due to the low precision. Specifically, using the BiLSTM encoder with GloVe embedding, our model outperforms the best pipeline model (Peng et al., 2019) by 15.62, 8.93, 5.24, and 10.16 F 1 points on the four datasets. This result indicates that our end-to-end approach can effectively encode the interaction between target and opinion spans, and also alleviates the error propagation. In general, the other end-to-end methods are also more competitive than the pipeline methods. However, due to the limitations of relying on word-level interactions, their performances are less encouraging in a few cases, such as the results on Lap 14 and Rest 15. With the BERT encoder, all three end-to-end models achieve much stronger performance than their LSTM-based versions, which is consistent with previous findings (Devlin et al., 2019). Our approach outperforms the previous best results GTS (Wu et al., 2020) by 4.35, 5.02, 3.12, and 2.33 F 1 points on the four datasets.

Additional Experiments
As mentioned in Section 2.2.2, we employ the ABSA subtasks of ATE and OTE to guide our span pruning strategy. To examine if Span-ASTE can effectively extract target spans and opinion spans, we also evaluate our model on the ATE and OTE tasks on the four datasets. Table 3 shows the comparisons of our approach and the previous method GTS (Wu et al., 2020). 5 Without additional retraining or tuning, our model can directly address the ATE and OTE tasks, with significant performance improvement than GTS in terms of F 1 scores on both tasks. Even though GTS shows a better recall score on the Rest 16 dataset, the low precision score results in worse F 1 performance. The better overall performance indicates that our span-level method not only benefits the sentiment triplet extraction, but also improves the extraction of target and opinion terms by considering the semantics of each whole span rather than relying on decoding heuristics of tagging-based methods.
5 See Appendix for the target and opinion data statistics. Note that the JET model (Xu et al., 2020b) is not able to directly solve the ATE and OTE tasks unless the evaluation is conducted based on the triplet predictions. We include such comparisons in the Appendix.   (Chen and Qian, 2020) in the Appendix. RACL is the current state-of-the-art method for both tasks. However, their framework does not consider the pairing relation between each target and opinion, therefore it is difficult to have a completely fair comparison.

Comparison of Single-word and Multi-word Spans
We compare the performance of Span-ASTE with the previous model GTS (Wu et al., 2020) for the following two settings in Table 4: Single-Word: Both target and opinion terms in a triplet are singleword spans, Multi-Word: At least one of the target or opinion terms in a triplet is a multi-word span. For the single-word setting, our method shows consistent improvement in terms of both precision and recall score on the four datasets, which results in the improvement of F 1 score. When we compare the evaluations for multi-word triplets, our model achieves more significant improvements for F 1 scores. Compared to precision, our recall shows greater improvement over the GTS approach. GTS heavily relies on word-pair interactions to extract triplets, while our methods explicitly consider the span-to-span interactions. Our span enumeration also naturally benefits the recall of multi-word spans. For both GTS and our model, multi-word triplets pose challenges and their F 1 results drop by more than 10 points, even more than 20 points for Rest 14. As shown in Table 1, comparing with the single-word triplets, multi-word triplets are common and account for one-third or even half of the datasets. Therefore, a promising direction for future work is to further improve the model's performance on such difficult triplets.
To identify further areas for improvement, we analyze the results for the ASTE task based on whether each sentiment triplet contains a multi-   word target or multi-word opinion term. From Table 5, the results show that the performance is lower when the triplet contains a multi-word opinion term. This trend can be attributed to the imbalanced data distribution of triplets which contain multi-word target or opinion terms.

Pruning Efficiency
To demonstrate the efficiency of the proposed dual-channel pruning strategy, we also compare it to a simpler strategy, denoted as Single-Channel (SC) which does not distinguish between opinion and target candidates. Figure 3 shows the comparisons. Note the mention module under this strategy does not explicitly solve the ATE and OTE tasks as it only predicts mention label m ∈ {V alid, Invalid}, where V alid means the span is either a target or an opinion span and Invalid means the span does not belong to the two groups. Given sentence length n and pruning threshold z, the number of candidates is limited to nz, and hence the computational cost scales with the number of pairwise interactions, n 2 z 2 . The dual-channel strategy considers each target-opinion pair where the pruned target and opinion candidate pools both have nz spans. Note that it is possible for the two pools to share some candidates. In comparison, the single-channel strategy considers each target-opinion pair where the target and opinion candidates are drawn from the same single pool of nz spans. In order to consider at least as many target and opinion candidates as the dual-channel strategy, the single-channel strategy has to scale the threshold z by two, which leads to 4 times more pairs and computational cost. We denote this setting in Figure 3 as SC-Adjusted. When controlling for computational efficiency, there is a significant performance difference between Dual-Channel and Single-Channel in F 1 score, especially for lower values of z. Although the performance gap narrows with increasing z, it is not practical for high values. According to our experimental results, we select the dual-channel pruning strategy with z = 0.5 for the reported model.

Qualitative Analysis
To illustrate the differences between the models, we present sample sentences from the ASTE test set with the gold labels as well as predictions from GTS (Wu et al., 2020) and Span-ASTE in Figure 4. For the first example, GTS correctly extracts the target term "Windows 8" paired with the opinion term "not enjoy", but the sentiment is incorrectly predicted as positive. When forming the triplet, their decoding heuristic considers the sentiment inde- pendently for each word-word pair: {("Windows", "not", Neutral), ("8", "not", Neutral), ("Windows", "enjoy", Positive), ("8", "enjoy", Positive)}. Their heuristic votes the overall sentiment polarity as the most frequent label among the pairs. In the case of a tie (2 neutral and 2 positive), the heuristic has a predefined bias to assign the sentiment polarity to positive. Similarly, the word-level method fails to capture the negative sentiment expressed by "not enjoy" on the other target term "touchscreen functions". In the second example, it incompletely extracts the target term "Korean dishes", resulting in the wrong triplet. For both examples, our method is able to accurately extract the target-opinion pairs and determine the overall sentiment even when each term has multiple words.

Ablation Study
We conduct an ablation study to examine the performance of different modules and span representation methods, and the results are shown in Table 6. The average F 1 denotes the average dev results of Span-ASTE on the four benchmark datasets over 5 runs. Similar to the observation for coreference resolution (Lee et al., 2017), we find that the ASTE performance is reduced when removing the span width and distance embedding. This indicates that the positional information is still useful for the ASTE task as targets and opinions which are far apart or too long are less likely to form a valid span pair. As mentioned in Section 2.2.1, we explore two other methods (i.e., max pooling and mean pooling) to form span representations instead of concatenating the span boundary token representations. The negative results suggest that using pooling to aggregate the span representation is disadvantageous due to the loss of information that is useful for distinguishing valid and invalid spans.

Related Work
Sentiment Analysis is a major Natural Language Understanding (NLU) task  and has been extensively studied as a classification problem at the sentence level (Raffel et al., 2020;Lan et al., 2020;. Aspect-Based Sentiment Analysis (ABSA) (Pontiki et al., 2014) addresses various sentiment analysis tasks at a finegrained level. As mentioned in the Section 1, the subtasks mainly include ASC (Dong et al., 2014;Zhang et al., 2016;He et al., 2018b;Li et al., 2018a;Peng et al., 2018;Wang and Lu, 2018;Li and Lu, 2019;Xu et al., 2020a), ATE (Qiu et al., 2011;Yin et al., 2016;Li et al., 2018b;Ma et al., 2019), OTE (Hu and Liu, 2004;Yang and Cardie, 2012;Klinger and Cimiano, 2013;Yang and Cardie, 2013). There is also another subtask named Target-oriented Opinion Words Extraction (TOWE) (Fan et al., 2019), which aim to extract the corresponding opinion words for a given target term. Another line of research focuses on addressing different subtasks together. Aspect and Opinion Term Co-Extraction (AOTE) aiming to extract the aspect and opinion terms together Ma et al., 2019;Dai and Song, 2019) and is often treated as a sequence labeling problem. Note that AOTE does not consider the paired sentiment relationship between each target and opinion term. End-to-End ABSA (Li and Lu, 2017; jointly extracts each aspect term and its associated sentiment in an end-to-end manner. A few other methods are recently proposed to jointly solve three or more subtasks of ABSA. Chen and Qian (2020) proposed a relation aware collaborative learning framework to unify the three fundamental subtasks and achieved strong performance on each subtask and combined task. While Wan et al. (2020) focused more on aspect category related subtasks, such as Aspect Category Extraction and Aspect Category and Target Joint Extraction. ASTE (Peng et al., 2019;Wu et al., 2020;Xu et al., 2020b;Zhang et al., 2020) is the most recent development of ABSA and its aim is to extract and form the aspect term, its associated sentiment, and the corresponding opinion term into a triplet.

Conclusions
In this work, we propose a span-level approach -Span-ASTE to learn the interactions between target spans and opinion spans for the ASTE task. It can address the limitation of the existing approaches that only consider word-to-word interactions. We also propose to include the ATE and OTE tasks as supervision for our dual-channel pruning strategy to reduce the number of enumerated target and opinion candidates to increase the computational efficiency and maximize the chances of pairing valid target and opinion candidates together. Our method significantly outperforms the previous methods for ASTE as well as ATE and OTE tasks and our analysis demonstrates the effectiveness of our approach. While we achieve strong performance on the ASTE task, the performance can be mostly attributed to the improvement on the multi-word triplets. As discussed in Section 4.1, there is still a significant performance gap between single-word and multiword triplets, and this can be a potential area for future work.  Table 9 shows the number of target terms and opinion terms on the four datasets. Table 10 shows the results of our model on the development datasets.

D Additional Comparisons
As mentioned by footnote 5 in Section 3.5, we cannot make a direct comparison with the JET model (Xu et al., 2020b), as it is not able to directly solve the ATE and OTE tasks unless the evaluation is conducted based on the triplet results. Table 7 shows such comparisons. Our proposed method   generally outperforms the previous two end-to-end approaches on the four datasets. As mentioned in Table 3, it is challenging to make a fair comparison between the previous ABSA framework RACL (Chen and Qian, 2020), which also address the ATE and OTE tasks while solving other ABSA subtasks, and our approach as well as the GTS (Wu et al., 2020). This is because the mentioned approaches have different task settings. The RACL considers the sentiment polarity on the target terms when solving the ATE and OTE tasks, but GTS and our method both consider the pairing relation between target and opinion terms. However, for reference, Table 8 shows the compar-  isons of the three methods on the ATE and OTE tasks on the datasets released by Xu et al. (2020b).