Semantic and Syntactic Enhanced Aspect Sentiment Triplet Extraction

Aspect Sentiment Triplet Extraction (ASTE) aims to extract triplets from sentences, where each triplet includes an entity, its associated sentiment, and the opinion span explaining the reason for the sentiment. Most existing research addresses this problem in a multi-stage pipeline manner, which neglects the mutual information between such three elements and has the problem of error propagation. In this paper, we propose a Semantic and Syntactic Enhanced aspect Sentiment triplet Extraction model (S3E2) to fully exploit the syntactic and semantic relationships between the triplet elements and jointly extract them. Specifically, we design a Graph-Sequence duel representation and modeling paradigm for the task of ASTE: we represent the semantic and syntactic relationships between word pairs in a sentence by graph and encode it by Graph Neural Networks (GNNs), as well as modeling the original sentence by LSTM to preserve the sequential information. Under this setting, we further apply a more efficient inference strategy for the extraction of triplets. Extensive evaluations on four benchmark datasets show that S3E2 significantly outperforms existing approaches, which proves our S3E2's superiority and flexibility in an end-to-end fashion.


Introduction
Aspect-based Sentiment Analysis (ABSA) usually requires to extract comment targets in a review and judge corresponding sentiment polarities (Liu, 2012;Pontiki et al., 2014). Such a research field has received widespread attention (Zhang et al., 2015;Lu, 2017, 2019;Li et al., 2019a). In this paper, we concentrate on a more relatively fine-grained task -Aspect Sentiment Triplet Extraction (ASTE) , which aims to extract triplets, including aspects (e.g., entities), *Corresponding Author the corresponding sentiment for each aspect, and the opinion spans explaining the reason for the sentiment. An example is shown in Fig. 1. It contains two triplets, (W aiters, f riendly, +) and (f ruit salad, so so, 0) where we use +, -, and 0 to represent positive, negative, and neutral sentiment. Unlike the ABSA task that extracts two tuples, (W aiters, +) and (f ruit salad, 0) in this sentence, such triplets extracted by ASTE task can better reflect multiple emotional factors (aspect, opinion, sentiment) from the user reviews and are more suitable for practical application scenarios.
The ASTE task is extremely challenging because it requires extracting these three elements in one shot. Straightforwardly, one naive solution is to split the ASTE task into two stages in a pipeline manner using a unified tagging schema 1 . Such a pipeline approach lacks an effective mechanism to capture the three elements' relationship and suffers from error propagation. Another solution for the ASTE task is to use an end-to-end model to extract triplets Wu et al., 2020). Yet, these methods focus on designing a new tagging schema to formalize ASTE into a unified task and cannot effectively establish the connection between words and ignore the semantic and syntactic relationship between the three elements.
Besides, a sentence may contain a one-to-many case, that is, one aspect corresponds to multiple opinions, or one opinion corresponds to multiple aspects. For instance, in the sentence "We love the food, drinks, and atmosphere," the opinion "love" is associated with three aspects "food", "drinks", and "atmosphere". This situation is quite common in reality, increasing the difficulty of match- Figure 1: An example of the ASTE task. The words in the solid and dashed boxes are aspects and opinions, respectively. The blue arrows above represent the correspondence between them. The black arrows below represent the dependencies between words.
ing aspects with opinions. Nevertheless, current solutions either fail to capture these one-to-many relationships  or ignore the semantic relationship between word pairs in a triplet (Wu et al., 2020).
Furthermore, various relationships exist among triplets, such as syntactic dependence and semantic word similarity, which have been neglected. For example, as shown in Figure 1, there is a nominal subject dependency (called nsubj) between waiters and f riendly, indicating that there exists an aspect. Also, the two opinions, f riendly and so so in the sentence are associated with each other, where there is a conjunct dependency (called conj), implying they have similar attributes.
To fully utilize these implicit relationships, we design a Semantic and Syntactic Enhanced Aspect Sentiment Triplet Extraction model (S 3 E 2 ). S 3 E 2 utilizes semantic and syntactic information from words, which helps to distinguish words' attributes and identify the relationship between word pairs. In order to better leverage these relationships, we build a Graph Neural Network (GNN) based model to capture the interactions between words and triplet elements. For each sentence, we transform it into a unique text graph representation, where each node is a word, and the edges are established based on attention to the words themselves, adjacent relationships, and syntactic dependencies. Such a concise and effective text graph can obtain the precise meaning of each word and gain insight into their relations.
Moreover, we further utilize LSTM (Hochreiter and Schmidhuber, 1997) to learn the contextual semantics of each word from a sequential perspective, forming a Graph-Sequence duel modeling of a sentence. In this way, S 3 E 2 has an excellent ability to distinguish the categories of words and more accurately recognize the relationship between word pairs. With the semantic and syntactic enhanced module, the correlation between word pairs is well captured, yielding a more simple inference strategy for triplet extraction. Since S 3 E 2 can perceive the semantics and syntax from words excellently, we only need to infer once for all datasets to obtain more accurate triplets and save time overhead. Finally, we parse out the triplets from the final predictions.
We run extensive experiments on four benchmark datasets. The experimental results show that S 3 E 2 achieves significantly better performance than existing state-of-the-art approaches by fully exploiting the syntactic and semantic 18 relationships between word pairs.
To summarize, our main contributions include the following: • We design a graph representation of a sentence which integrates the syntactic dependency, semantic relatedness, and positional relationship between words, and encode it with Graph Neural Networks to fully exploit the various correlations. • We further model the sentence with LSTM to incorporate its sequential information, forming a Graph-Sequence duel modeling paradigm. Moreover, we only need to infer once for all datasets, demonstrating the superiority of S 3 E 2 . • We make extensive experiments, and the results show S 3 E 2 outperforms all state-of-the-art approaches significantly for triplet extraction.

Our Approach
We design an effective framework to complete triplet extraction in an end-to-end fashion. The overall model architecture is shown in Figure 2. In this section, we first define the ASTE task, describe the Grid Tagging Schema and deconstruct triplets Figure 2: The overall architecture of our end-to-end model S 3 E 2 . In our text graph, the type of dashed edges is self-loop, the type of black solid edges is neighbor edge, and the type of red solid edge is dependent edge.
from it in detail. We next present S 3 E 2 model, followed by our inference strategy.

Task Definition and Preliminaries
Definition: Triplet Extraction. Given an input sentence x = {x 1 , x 2 , · · · , x n } with length n, each word has two tag labels: the aspect tag label and the opinion tag label, respectively. Their tagging schema is Y = {B, I, O}, denoting the beginning, inside, outside of one aspect term or opinion term. Meanwhile, each aspect target is annotated with a sentiment polarity label S = {NEU, POS, NEG}, denoting neutral, positive, and negative sentiment expressed towards itself. Our goal is to extract a set of triplets T = {(a, o, s) m } |T | m=1 from the sentence x, where the notations a, o, and s stand for an aspect, an opinion, and corresponding sentiment polarity, respectively. The notation (a, o, s) m is a triplet in x and |T | represents the total number of triplets in this sentence.
Grid Tagging Schema. To tackle the ASTE task, a Grid Tagging Schema (GTS) was proposed  Wu et al. (2020), which adopts six tags G = {A, O, NEG, NEU, POS, N} to represent the relationship for any pair of two words (w i , w j ) in a sentence. The two tags, A and O, denote the word-pair (w i , w j ) is the same aspect or opinion, respectively. The three tags NEG, NEU, POS denote negative, neutral, or positive emotions expressed for the triplet consisting of the pair of words (w i , w j ) that exactly contains an aspect term and an opinion term. The tag N denotes non above relations for word-pair (w i , w j ). A tagging example is shown in Figure 3. In detail, the three coordinates in the grid (5, 5), (6, 6), and (6, 5) respectively form word pairs (f ruit, f ruit), (salad, salad), and (f ruit, salad), which are labeled A because they all belong to the same aspect. The same logic applies to opinions. The coordinate (2, 0) is labeled POS because it makes a correct triplet (W aiters, f riendly, P OS), which contains exactly the right aspect, opinion, and sentiment information. For simplicity, we use an upper triangular grid.
Triplets Decoding. we explain how to decode triplets based on the predicted grid tags. We take the decoding algorithm designed by Wu et al. (2020). First, both aspects and opinions were identified using the predictive tags of all word pairs (w i , w j ) on the main diagonal without considering other word pairs' constraints. The span consisting of continuous A is regarded as a complete aspect, and the span consisting of continuous O is detected as a complete opinion. At this point, we have extracted the aspect a and opinion o. Then, we count the predicted tags of all word pairs (w i , w j ) when w i ∈ a and w j ∈ o. The most predictive sentiment label s ∈ S is regarded as sentiment polarity for triplet (a, o, s). When there are multiple most predictive sentiment labels, then the label is decided by the order: positive > neutral > negative. If they are all predicted to be label N, we consider that a and o cannot constitute a triplet.

Semantic and Syntactic Enhanced ASTE Model
Since this task requires extracting multiple elements from a sentence, it is important to design a model that can effectively distinguish the properties of words and master the relationship between them. S 3 E 2 first uses LSTM to encode sentences so that we can perceive contextual semantic. In order to capture many-sided features, S 3 E 2 next applies graph neural network to model syntactic dependency, semantic relatedness, and positional relationship between words. Finally, an inference strategy is proposed, which only makes one inference to further extract more accurate triplets for all datasets.

Graph-Sequence Duel Representation
We first apply a bidirectional Long Short Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) to encode the input sentence x. LSTM is capable of learning contextual semantic representation since it can mark key semantics from previous time steps. Hence, we learn contextual features {h 1 , h 2 , · · · , h n } for the input sequence. We observe that different words in a sentence often have various internal relationships. As elaborated in Figure 1, there is a syntactic dependency between waiters and f riendly, since opinions often modify aspects. Besides, words that are semantically similar may also be related. The two opinions, f riendly and so so, although they are far apart, there is still a dependency between them. Therefore, it is of great help to model the relationships and grasp semantic and syntactic information from words. With this in mind, we build a unique text graph for every input sentence using graph neural network.
Formally, a text graph G = (V, E) is a structure used to represent words and their relations, which consists of the set of nodes V and the set of edges E. Each word in the sentence is regarded as a node, while the relationships between words are considered edges. We construct three types of edges: self-loop edge, neighbor edge, and dependency edge. If there is an edge connecting to the node itself, then the edge is the self-loop edge. The edge connecting a node and its neighbor is a neighbor edge, while if there exists a dependency relationship between two nodes, then there is a dependency edge between them. Specifically, we define the text graph as follows: where D i represents a set of nodes with which node v i has a dependency. All edges are bidirectional and the node feature for v i is taken from h i . We adopt GraphSAGE (Hamilton et al., 2017) to generate representations h 1 , h 2 , · · · , h n for each node. We chose LSTM aggregator from Graph-SAGE because it has stronger expressive ability. Then, we concatenate the integrated representations of w i and w j to represent all word pairs ] is a concatenation operation. All representations of word pairs correspond to cells in our grid, which is then fed to a linear layer to calculate initiatory probability distribution z ij ∈ R |G| through: where W s and b s are trainable parameters.

Inference Strategy
The initial probability distribution z ij between all word pairs obtained above can further facilitate more accurate extraction of triplets. For instance, if (0, 0) and (2, 2) in grid tagging example are predicted to be A and O, respectively, then the position at which they intersect (0, 2) is even less likely to be predicted to be N, and vice versa. Also, since many aspects or opinions are made up of multiple words, if a certain coordinate is predicted as one of S, then its adjacent locations are more likely to be predicted to be the same sentiment label. Therefore, we employ an inference strategy to obtain more accurate triplets by observing the characteristics of the initial probability distributions through the below processes. Formally, new feature representation g ij learning is as follows: where W g and b g are trainable parameters. The symbol [·; ] represents a concatenation operation. Concretely, z i,: = (z 1:i,i , z i,i:n ) because of the upper triangular grid in GTS. z i /z j works by capturing the associated features between w i /w j and other words.
It is worth noting that inference strategy by Wu et al. (2020) are unable to well capture the relationship between words, thus yielding indefinite number of iterations for inference, which increases the time complexity when the number of inferences is large. In contrast, we only need to infer once for all datasets with semantic and syntactic enhanced module, which further proves the superiority of S 3 E 2 .
Finally, we send g ij to a linear layer with softmax activation function for classification.
where W p and b p are trainable parameters.

Training Loss Function
The training goal for the ASTE task is to minimize the cross-entropy error for all word pairs. The unified loss function is defined as: where y ij denotes the one-hot vector of ground truth for the word pair (w i , w j ) and I(·) indicates the k-th component being 1.

DataSets
We conduct experiments on four datasets integrated by Wu et al. (2020). Each dataset has been divided into three parts: training set, validation set, and test set. Table 1 lists the statistics for these datasets. 14res, 15res, and 16res belong to the restaurant domain, while 14lap is of laptop domain. Each sentence has been annotated with a sequence of aspect tags and opinion tags and sentiment polarity of corresponding aspects. These datasets originally come from SemEval Challenges (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016. Note that each sentence may have more than one aspect and opinion. Besides, one aspect may be associated with multiple opinions and vice versa. For 14res, 14lap, 15res, and 16res, the proportion of one-to-many data reaches 37.27%, 38.54%, 33.39%, and 33.13%, respectively. Various relationships usually exist between aspects and opinions, using them is beneficial to triplet extraction. We count the ratio of triplets with implicit relationships. For these four datasets, they are 79.37%, 74.22%, 76.27%, and 80.57%, respectively.

Baselines
We compare the performance of S 3 E 2 with the following approaches, where most triplet extraction models currently are done in a pipeline manner, and few state-of-the-art models are in an end-toend way.
• Peng-unified-R+PD. Peng et al. (2020) proposed a pipeline approach in two stages. The first stage model (Peng-unified-R) jointly extracts aspects with sentiment using the unified tagging schema and opinion location in the BIEOS tagging schema. It leverages mutual information between aspects and opinions. In the second stage, all candidate triplets are generated, and a MLP-based classifier (PD) is applied to determine whether each triplet is valid or not. • Li-unified-R+PD. A pipeline approach combined by Peng et al. (2020). In the first stage, the model (Li et al., 2019a) is modified to coextract aspects with sentiment as well as extracting opinion. In the second stage, it applies the same classifier (PD) mentioned above to obtain all the valid triplets. • Peng-unified-R+IOG. A pipeline approach combined by Wu et al. (2020). It first employs the model Peng-unified-R of Peng et al. (2020) for extracting aspects with sentiment, then uses IOG (Fan et al., 2019) to produce final triplets. The IOG encodes the information from a given asepct to extract its opinion words. • IMN+IOG. Another pipeline approach combined by Wu et al. (2020). It first employs the IMN (He et al., 2019) for extracting aspects with sentiment, then uses the IOG (Fan et al., 2019) to produce final triplets. • Grid. A state-of-the-art approach model proposed by Wu et al. (2020), which designs a grid tagging schema to address triplet extraction in an end-to-end way. It employs an inference strategy to utilize the mutual indications between different opinion factors. For a fair comparison, we choose their model Grid-CNN and Grid-BiLSTM, which use CNN encoder and BiLSTM encoder respectively.

Implementation Details
Following the previous work (Wu et al., 2020), we combine a 300-dimension domain-general embedding from GloVe (Pennington et al., 2014) and pre-trained with 840 billion tokens and a 100-dimension domain-specific embedding trained with fastText (Bojanowski et al., 2017) to initialize double word embeddings for S 3 E 2 . The learning rate is 0.001, and the dropout rate is 0.5. We use Adam (Kingma and Ba, 2015) as S 3 E 2 optimizer. The number of layer for LSTM is 1 and the cell is set to 50. The aggregator type from GraphSAGE we chose is LSTM. We use Stanza (Qi et al., 2020) to parse the dependencies in the sentence. The batch size is set to 32 for all datasets and the valid set is used for early stopping. We select the best model according to the best F1 score on the valid set and run the test set with it for evaluation. Following previous work, we report experimental results based on precision (P), recall (R), and F1 scores. Note that the F1 score measures the performance of mating triplets, which means a triplet is correct only when the aspect span, its corresponding sentiment, and opinion span are all proper. Table 2 presents the main results of the final triplet extraction. S 3 E 2 surpasses all baselines significantly on all datasets. Compared with the best results of existing baselines, S 3 E 2 still achieves an apparent absolute F1 scores increase of 2.02% and 1.31% on 15res and 16res, respectively, and achieved an impressive increase of 0.80% and 0.63% on 14res and 14lap, respectively. Except for Grid-CNN and Grid-BiLSTM, the other models are all pipeline methods.

Main Results For Triplet Extraction
The experimental results show that S 3 E 2 is far beyond these methods, which also strongly proves the advantages of the semantic and syntactic enhanced model. When we compare S 3 E 2 with competitive baselines, Grid-CNN and Grid-BiLSTM in detail, we find that the reason why we perform better on 14res and 15res is because we extract a more complete set of triplets in these two datasets, resulting a more significant recall. The reason why we perform better on 14lap and 16res is because we extract more accurate triplets, resulting a more significant precision. Such comprehensive results demonstrate the strength of S 3 E 2 , which has the ability to learn multi-faceted semantics and and is good at extracting triplets.

Ablation Study
To investigate the effectiveness of different modules in S 3 E 2 , we conduct ablation study for the ASTE task. As shown in Table 3, S 3 E 2 represents our full model that equipped with all modules. Next, we will carefully observe the role of each module by introducing four model variants, namely Dep, Infer, Graph, and BiLSTM.
Infer means removing the inference strategy from S 3 E 2 . We can see that F1 scores drop sharply, which shows that the inference strategy can grasp the relationship between the three elements in the triplets from the previous round of predictions to promote the ASTE task. Dep means that when constructing a text graph for a sentence, we do not add the third edge type mentioned above. We can see that F1 scores drop except for res14, showing that overall the dependent edges can help the model better master relationships. The training set of 14res is larger than other datasets. When training the full model, we may overfit due to the setting of parameters (e.g., epoch, batch size), resulting in slightly lower performance, compared with Dep. Graph means removing the graph-based GNN modules. After removing the entire graph, the performance of the model is greatly reduced. Obviously, the graph neural networks can well perceive the relational semantics and distinguish the characteristics of the words. The F1 scores also decline sharply when we remove the BiLSTM, which shows that contextual semantic information is helpful. Comparing Graph and BiLSTM, we find that the former has higher results on 14lap and 16res. It may be that these two datasets are more dependent on contextual semantic features. In general, each module of S 3 E 2 contributes to the extraction of triplets.

Effects of Aggregator Types
In order to study the impact of aggregator types on performance, we report the results of different aggregator types for the ASTE task on these four datasets in Table 4. There are two types of aggregators, LSTM and Mean, adopted from (Hamilton et al., 2017). The former is based on the LSTM structure (Hochreiter and Schmidhuber, 1997) and is applied to the random arrangement of the node's neighbors. The latter is just based on the mean operation. As shown in Table 4, when the network layers of the two aggregators are equal, no matter how many layers, the effect of the LSTM aggregator is better than that of the Mean aggregator. This phenomenon indicates that the LSTM aggregator has stronger expressive ability and is more suitable for the ASTE task.

Effects of Graph Network Layers
To examine the effects of the number of graph network layer, we also present the results of different layers on these four datasets to extract triplets. It can be observed that the experimental performance increases as the number of layers increases from 2 to 3 for the same type of aggregator. This proves that the ability of graph neural networks to gather features is related to the number of network layers. We notice that when the number of layers is set to 2, the LSTM aggregator has higher performance than the Mean aggregator by 0.55%, 0.32%, 0.69%, and 0.23% on the four datasets, respectively. Nevertheless, when the number of layers is 3, their performance differs by 1.80%, 1.75%, 4.56%, and 3.17%. As the number of layers increases, the performance gap between the LSTM aggregator and the Mean aggregator widens significantly, which further illustrates the advantage of the LSTM aggregator.

Case Study
Five typical cases are presented in Table 5. The first example is a simple case without complicated word order and all models can predict accurately. The second example comes from the restaurant field, which expresses a negative attitude tactfully. Both Grid-BiLSTM and Grid-CNN incorrectly predict sentiment for "staff", and Grid-CNN mistakenly predicts "should be" as an aspect.
The third example directly expresses negative sentiment, which is picked from the laptop field. We can observe that Grid-LSTM and Grid-CNN mistakenly regard "maintain" as an aspect, and also make a false prediction for sentiment. For these two examples, S 3 E 2 makes accurate judgments, which shows that S 3 E 2 can better understand the context and distinguish the characteristics of words.
There are 2 triplets in the fourth example. All methods extract the triplet containing "screen". Unlike other models, S 3 E 2 successfully identifies the second aspect "speed" and its sentiment. Though lacking of an opinion word "much", S 3 E 2 has stronger recognition ability.
The last one is a more complicated example with 3 triplets, where an aspect corresponds to multiple opinions. We see that Grid-BiLSTM mistakenly matches "food" and "excellent" as a triplet. Both Grid-CNN and S 3 E 2 make correct predictions. In general, the above analysis further proves that S 3 E 2 can better understand the semantics and recognize the relationship more accurately.

Related Work
ASTE originates from another highly concerned research topic called Aspect Based Sentiment Analysis (ABSA) (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016. The research process of ABSA can be divided into three stages. Separate Extraction. Traditional studies have divided ABSA into three subtasks, namely, aspect extraction (AE), opinion extraction (OE), and aspect sentiment classification (ASC). The AE task (Yin et al., 2016;Li et al., 2018b;Xu et al., 2018;Ma et al., 2019) requires the extraction of aspects, while the OE task's goal (Fan et al., 2019) is to identify opinions expressed on them. The ASC task has attracted much more attention, which refers to classifying sentiment polarity for a given aspect target Ma et al., 2018;Li et al., 2018a;Xue and Li, 2018;Wang et al., 2018;Li et al., 2019b) because the sentiment element carries crucial semantic information for a text. Zhang et al. (2019) develops aspect-specific Graph Convolutional Networks (ASGCN) that in-tegrates with LSTM for the ASC task. Compared with ASGCN, S 3 E 2 has richer edge types and fewer training parameters. Since its aspect-specific structure must depend on the given aspect, ASGCN lacks scalability and cannot be extended to triplet extraction in an end-to-end fashion. Besides, solving these three subtasks individually lacks practical application value and ignores the internal relation between them.
Pair Extraction. Recently, many studies have proposed effective models to jointly extract aspects and their sentiments (Zhang et al., 2015;Lu, 2017, 2019;Li et al., 2019b,a). Hu et al. (2019) design a Span-Based method but conclude the pipeline model is better than the unified model. There is also a practice to co-extract aspects and opinions Dai and Song, 2019). These pair extraction models still cannot fully understand a complete picture regarding sentiment and dig deeper into the interconnections between subtasks.
Triplet Extraction. The ASTE task is more challenging and application value. Peng et al. (2020) first propose a two-stage model for ASTE, which in the first stage co-extracts aspects with the associated sentiment and finishes opinion extraction in the form of a standard sequence labeling task. The second stage employs a binary classifier to match aspects and opinions to obtain final triplets. Following this work,  employ a model with a position-aware tagging scheme to extract a triplet jointly, but it cannot apply to the one-tomany phenomenon. Wu et al. (2020) design a novel grid tagging schema to address triplet extraction, but their end-to-end model ignores the dependencies among words. Besides, the inference rounds of their inference strategy are not unified for each dataset, which may cause instability and high time complexity if the rounds rise.
Aspect Sentiment Triplet Extraction (ASTE) requires extracting aspects, corresponding opinions, and sentiment from user reviews. Different from previous work, we take advantage of multiple semantic relationships between word pairs and effectively capture the inner connection between such three elements. In this paper, we construct a novel model with a relational structure by creating a unique text graph for each sentence using Graph Neural Network (GNN). We also combine LSTM to obtain contextual semantics. Through the above mentioned rich structure, S 3 E 2 can understand the context well and effectively recognize the identify between words. Besides, the inference strategy becomes more efficient because it only needs to be inferred once for all datasets, reducing the time complexity. Our end-to-end model achieves stateof-the-art performance on all datasets for triplet extraction. Experimental results show that S 3 E 2 remarkably captures the connection between word pairs and recognizes their relationship.