On the Strength of Sequence Labeling and Generative Models for Aspect Sentiment Triplet Extraction

,


Introduction
Aspect sentiment triplet extraction (ASTE) aims at extracting all triplets in a sentence, consisting of the aspect/opinion terms and the sentiment polarity on them.Given the example "Their twist on pizza is healthy, but full of flavor." in Fig. 1 (a), the goal is to extract two triplets (twist on pizza, healthy, positive) and (flavor, full, positive).
Conventional approaches to ASTE include pipeline (Peng et al., 2020), table filling (Chen et al., 2022), sequence tagging (Xu et al., 2020;Wu et al., 2020b), and hybrid ones (Xu et al., 2021).More recently, there is an emerging trend in adopting generative models for ASTE (Yan et al., 2021;  c) illustrate the difference between our proposed generative method and existing ones for this task, where X is the input sentence, and T denotes the target triplet.X a /X o contains the prompt prefix to define the decoding order (aspect or opinion first) while Y a /Y o indicates the generated sequences following the order in X a /X o .MOSL is our marker-oriented sequence labeling module to improve the generative model's ability of handling complex structures.Zhang et al., 2021b,a;Lu et al., 2022) to alleviate error propagation and exploit full label semantics.
Current generative ASTE models employ a classical encoder-decoder architecture and follow a paradigm that first generates a target sequence Y and then recovers the triplets T from the sequence Y .The model needs to pre-define an output template ψ(•) to convert ASTE into text generation and then calculates the loss between the triplet and the generated sequence for model training, as shown in Fig. 1 (b).The template ψ(•) constructed by existing methods is in the form of ψ a→o or ψ o→a , reflecting the unidirectional dependency from aspect to opinion, or vice versa.However, the aspect and opinion terms that appear together in one sentence might hold informative clues to each other (Chen and Qian, 2020b) and there is no intrinsic order between them (Chen et al., 2021).Hence, modeling unidirectional dependency may mislead the model to generate false paired triplets like (twist on pizza, full, positive).
Existing generative ASTE models also suffer from another challenging problem, i.e., lacking the ability to handle complex structures especially multi-word terms and multi-triplet sentences.On one hand, the token-by-token decoding manner makes the model focus only on the next token at each time step of decoding without grasping the whole information of the aspect/opinion term with multiple words.On the other hand, generative models often deploy the simple-structured prompt template to ensure the generation quality.When handling the sentence with multiple triplets, a generative model needs to invoke a template several times, which may lead to an information confusion for the same marker in the template.
To address the aforementioned issues, we propose a sequence labeling enhanced generative model for ASTE.
Firstly, we design two bidirectional templates with different decoding orders to simultaneously capture the mutual dependency between the aspect and opinion terms.In particular, we add two types of prompt prefix before the input sentence to indicate the decoding order, and we also present two output templates ψ a→o and ψ o→a , both consisting of the markers {aspect, opinion, sentiment} and the corresponding labels {a, o, s}.In this way, the decoder can generate two sentences reflecting dependency from aspect to opinion and that from opinion to aspect.
Secondly, we propose a marker-oriented sequence labeling (MOSL) module, which can enhance the generative model's ability to handle complex structures.Specifically, the decoding is conducted after the MOSL module at the training stage.Hence the BIO tags obtained in MOSL help the generative model capture the boundary information of multi-word aspect/opinion terms in advance.Moreover, while the generative model needs to invoke the output templates several times for the multitriplet sentence, we adopt different marker vectors in MOSL for the same marker in the generative model.By doing this, we can share the markers without causing confusion.Since the markers encode information across multiple triplets in one sentence, previous markers can contribute to the decoding of subsequent triplets.The illustration of our proposed method is shown in Fig. 1 (c).
We conduct extensive experiments on four datasets with both full supervised and low-resource settings.The results demonstrate that our model significantly outperforms the state-of-art baselines for the ASTE task.
To meet the practical need, some recent studies propose to extract two or more elements simultaneously, including aspect opinion pair extraction (Zhao et al., 2020;Wu et al., 2021;Gao et al., 2021), end-to-end aspect-based sentiment analysis (Hu et al., 2019;Chen and Qian, 2020b;Oh et al., 2021), and aspect sentiment triplet extraction.Among them, ATSE is regarded as a near complete task and is of the most challenge.
Earlier work in ATSE can be sorted into four streams, i.e., pipeline (Peng et al., 2020), table filling (Chen et al., 2022), sequence tagging (Xu et al., 2020;Wu et al., 2020b), and hybrid ones (Xu et al., 2021;Chen et al., 2021;Mao et al., 2021).These methods do not fully utilize the rich label semantics and some of them may encounter the error propagation problem.
Another line of research in ASTE performs this task in a generative manner (Zhang et al., 2021a,b).For example, Yan et al. (2021) model the extraction and classification tasks as the generation of pointer indexes and class indexes.Lu et al. (2022) introduce the structured extraction language and structural schema instructor to unify all information extraction tasks.While getting better performance, current generative models are prone to generate false paired triplets and are not suitable for tackling complex structures.Our generative model addresses these issues with the proposed bidirectional templates and the marker-oriented sequence labeling module.

Our Method
Given a review sentence X with L words, the goal of ASTE is to extract all triplets T = {(a, o, s)} N i=1 in X, where N is the number of triplets, and a, o, and s denotes aspect term, opinion term, and sentiment polarity, respectively.
We first introduce the overall architecture of our proposed sequence labeling enhanced generative model (SLGM) in Fig. 2, which has the following distinguished characteristics.
(1) To capture the mutual information between the aspect and opinion terms, we construct two bidirectional templates at both the input and output ends, shown as X a /X o and ψ a→o /ψ o→a in Fig. 2.
(2) To handle complex structures, we propose a marker-oriented sequence labeling (MOSL) module to capture the boundary information of multiword aspect/opinion terms and the shared marker information of multi-triplets.

Bidirectional Template
Our bidirectional templates are used to guide the generation model in an end-to-end way.
For the input review X, we construct two sentences X a and X o by adding two types of prompt prefix, i.e, "aspect first:" and "opinion first:".Such prefix can prompt the model to generate target sequence with specific decoding order when we finetune the model with these templates.
To get the output triplets T in a generative manner, an essential step is linearizing triplets T into a target sequence during training and de-linearizing triplets from the predicted sequence during inference.In particular, a good output template is expected to: 1) ensure that the linearized target sequence can be easily de-linearized into a collection of triples, 2) contain specific markers to prompt the decoding process of labels, 3) free to change the order of labels.Based on the above considerations, we propose two marker-based templates ψ a→o and ψ o→a with different decoding orders between aspect and opinion terms as follows: ψ a→o → aspect : a, opinion : o, sentiment : s ψ o→a → opinion : o, aspect : a, sentiment : s Our output templates consist of two parts: the markers {aspect, opinion, sentiment} and the corresponding labels {a, o, s}.The markers can guide the model to generate the specific type of label at the next step.When the input review contains several triplets, we need to sort the triplet order to ensure the uniqueness of the target sequence.For the template ψ a→o , we sort triplets by the end index of aspect term in an ascending order.If some triplets share the same aspect term, we further sort them by the end index of opinion term.After obtaining text segments of triplets, we use a special symbol [SSEP] to concatenate these segments to form the final target sequence.

Template-Guided Text Generation
We employ a standard transformer-based encoderdecoder architecture for the text generation process, and we initialize the model's parameters with the pre-trained language model T5 (Raffel et al., 2020).For simplicity, we take the sentence X a and the corresponding target sequence Y a based on the template ψ a→o as an example for illustration.We first feed X a into the transformer encoder to get contextual features H enc : We then use a transformer decoder to generate the target sequence Y a .At the t-th time step, the decoder will calculate the decoder hidden states h t based on the contextual features H enc and the previously decoded tokens y Next, h t is used to compute the conditional probability of the token y t : where W is the transformation matrix.Finally, we calculate the cross-entropy loss L a→o g between the decoder output and the target sequence Y a :

Marker-Oriented Sequence Labeling (MOSL)
The marker-based templates can prompt the generative model with the label types including aspect, opinion, and sentiment.However, the classic encoder-decoder architecture prevents the model from handling complex structures.On one hand, the decoding process is performed in a token-bytoken manner, which cannot provide clear bound-ary information for multi-word aspect/opinion terms.On the other hand, the model needs to invoke the output templates repeatedly when the sentence contains multiple triplets.The duplicate template based decoding may cause an information confusion and sacrifice the quality of the generated text.Therefore, we propose a marker-oriented sequence labeling (MOSL) module to solve these problems.The goal is to allow the model to incorporate the prompt information of aspect and opinion terms during the generation of the specific marker1 .Fig. 3 illustrates the text generation process enhanced by the marker-oriented sequence labeling (MOSL) module.
In MOSL, we will tag aspect and opinion terms through sequence labeling.We first use two linear transformations to extract aspect features Then, we take the last hidden state of the decoder corresponding to the markers as the marker features, including aspect marker features

is the number of triplets) and opinion marker features
We then calculate the marker-oriented features for m a i ∈ M a or m o i ∈ M o for sequence labeling: where σ(•) is the selu activation function, h a j ∈ H a and h o j ∈ H o are the aspect/opinion features2 .W and b are the transformation matrix and bias.
Note that we deploy a tag-then-generate mechanism at the training stage, which means the MOSL module will predict the BIO tags for tokens in a sentence, and then the generation model will start to decode the tokens.Such a mechanism can force the text generation module to capture the boundary information of multi-word aspect/opinion terms.
When the input sentence contains multiple triplets, the aspect/opinion marker features in different positions correspond to different tagged sequences in the MOSL module, e.g., where Y ma and Y mo are the BIO tags in sequence labeling.Hence the same marker in the generation module can share information without causing confusion since it has different pointers referring to multiple aspect/opinion terms in MOSL, which consequently benefits the decoding of the sentence containing multiple triplets.Then, we feed the markeroriented features into a fully connected layer to predict the tags of aspect/opinion terms and get the predicted probabilities over the label set: The training loss for MOSL is defined as the cross-entropy loss: where I(•) is the indicator function, y ma ij and y mo ij are the ground truth labels, and C denotes the {B, I, O} label set.
Training For a better understanding of bidirectional dependency and also for less space cost, we jointly optimize two bidirectional templates for the sentence and label pair (X, T ): where λ is a hyper parameter to control the contributions of different templates.

Inference
Constrained Decoding (CD) During inference, we employ a constrained decoding (CD) strategy to guarantee the content and format legitimacy, which is inspired by Bao et al. (2022); Lu et al. (2021).
The content legitimacy means that aspect/opinion terms should be a single word or multiple continuous words in the input sentence, and the sentiment must be either positive, neutral, or negative.
The format legitimacy means that the generated sequence should meet the formatting requirements defined in the template.Both types of legitimacy can be viewed as the constraint on the candidate vocabulary during the decoding process.Before decoding, we enumerate the candidate vocabulary for each token in the input sentence and templates.We then use the constrained decoding strategy to adjust the candidate vocabulary according to the current input token at each decoding time step.For example, when we input the start token "</s>" to the decoder, the candidate token should be "aspect"/"opinion" to guarantee the format legitimacy.When we input ":", the model needs to determine which is the first word of the aspect/opinion term, and the candidate tokens should be consistent with those in the input sentence.
Triplet De-linearization So far, we have generated two sequences Y a and Y o based on two input sentences X a and X o with the constrained decoding strategy.We then de-linearize them into two triplet sets T a and T o according to pre-defined templates ψ a→o and ψ o→a .We take the intersection of T a and T o as the final prediction results.

Datasets
Our proposed model is evaluated on four ASTE datasets released by Xu et al. (2020) which correct the missing triplets that are not explicitly annotated in the previous version (Peng et al., 2020).All datasets are based on SemEval Challenges (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016) ) and consist of reviews in the laptop and restaurant domains.Table 1 shows the statistics of four benchmark datasets.

Implementation Details
As mentioned in Sec.3.2, T5-Base (Raffel et al., 2020) is used to initialize the parameters of our model.We train our model using AdamW optimizer with an initial learning rate 3e-4 and linear learning rate decay.The number of training epoch is set to 20 for full supervised settings and 200 for low-resource and few-shot settings.When encoding the bidirectional dependency jointly, we set the batch size to 32 and λ to 0.5.The results for supervised and low-resource settings are averaged over five and ten runs with different random initialization, respectively.All experiments are conducted on an NVIDIA RTX 3090 GPU.

Baselines
To validate the effectiveness of our proposed model, we compare it with 14 state-of-art baselines.We divide the baselines into three categories.(1) pipeline methods: CMLA+, RINANTE+, Liunified-R, and Peng-two-stage are proposed by Peng et al. (2020).( 2) unified non-generative methods: JET-BERT (Xu et al., 2020), OTE-MTL (Zhang et al., 2020), GTS-BERT (Wu et al., 2020b), SPAN-ASTE (Xu et al., 2021), BMRC (Chen et al., 2021), EMC-GCN (Chen et al., 2022).(3) generative methods: BART-GEN (Yan et al., 2021), GAS (Zhang et al., 2021b), PARAPHRASE (Zhang et al., 2021b), SSI+SEL (Lu et al., 2022).method which relies on extra data like Wikipedia and Wikidata.The generative methods like GAS which use the classic encoder-decoder architecture can outperform most non-generative methods without complicated architectures through learning label semantics.We also find that the non-generative method BMRC achieves competitive precision scores on four datasets because it also considers the bidirectional dependency.By combining the text generation and sequence labeling in training for tackling the complex extraction scenarios, our SLGM method improves the precision of GAS by more than 7 points and the recall of BMRC by more than 10 points.Low-resource settings To validate the model's performance in the low-resource scenarios, we fol-low the settings in SSI+SEL (Lu et al., 2022) to conduct experiments on six different partitions of the original training sets (1/5/10-shot, 1/5/10%ratio) and report averaged scores over random 10 runs.SSI+SEL adopts a pre-training process which can help the model capture general information from additional data.However, as shown in Table 3, our SLGM achieves much better results than SSI+SEL by a large margin on all partitions without such a pre-training process.The performance gap between our SLGM and SSI+SEL becomes more impressive under the low-resource settings than that under the supervised ones.This clearly demonstrates that our SLGM model can be quickly adapted to the low-resource scenarios with very few samples, which is an extremely good property of our model.

Ablation Study
To examine the impacts of three key components in our model, including marker-oriented sequence labeling (MOSL), bidirectional templates (ψ a→o and ψ o→a ), and constrained decoding (CD), we conduct the ablation study on four datasets under supervised settings, The results are shown in Table 5.We make the following notes.Firstly, removing one of two bidirectional templates will cause a performance drop, and ψ a→o contributes more to the model than ψ o→a .
Secondly, the extraction performance decreases dramatically after removing MOSL.This clearly proves the effectiveness of MOSL module.We will make more exploration about the impacts of MOSL in Sec.5.3.
Thirdly, "w/o CD" denotes that we directly take the whole vocabulary instead of taking the format and content constraints into account.We find that the performance slightly degrades on Lap14, Res14, and Res15, but increases on Res16.The reason might be that limiting the size of candidate vocabulary leads the model to generate some wrong but legal triplets.However, the large amount of training data under the supervised settings allows the model to adaptively fit to the target text.
To confirm this hypothesis, we further investigate the impacts of CD under the low-resource settings on the Res16 dataset3 .The results are shown in Table 6.We can see that as the number of training samples decreases, the performance gain from CD becomes more significant.This infers that the CD strategy plays a more important role in data scarcity scenario.

Impacts of Bidirectional Templates
We model the mutual dependency between aspect and opinion terms using the bidirectional templates.Our purpose is to avoid generating false paired aspect-opinion triplets.We investigate the impacts of bidirectional templates and show the results in Table 4. Besides, we also plot the performance under different settings of λ to further validate the importance of bidirectional dependency as shown in Fig. 4.
It can be seen that the unidirectional decoding order T a /T o gets better recall scores but generates many false triplets, and thus has low precision.By capturing the mutual dependency and taking the intersection of T a and T o , our model can effectively filter false paired triplets and significantly enhance the precision and F 1 scores.Moreover, when λ is biased towards ψ a→o or ψ o→a , the performance tends to decrease.Meanwhile, when λ is set to 0.5, the model achieves optimal results on most of the datasets.This further confirms that the bidirectional dependency is of the same importance.

Impacts of Marker-Oriented Sequence Labeling (MOSL)
Table 1 shows that multi-word triplets account for roughly one-third of all triplets while about half of the sentences are multi-triplet ones.Our MOSL module allows the model to learn the prompt information of aspects and opinions based on our tagthen-generate mechanism during training, which improves the model's ability of handling complex structures.We verify the effects of MOSL in this section4 .Table 7 shows the performance with two different evaluation modes, where "Single-Word" denotes both aspect and opinion terms in a triplet are single-word spans, and "Multi-Word" denotes that at least one of the aspect or opinion terms in a triplet is a multi-word span.We find that the model obtains more significant improvements for multi-word triplets than that for single-word triplets after adding the MOSL module.It shows that the model can learn the boundary information of aspect/opinion terms and generate the complete terms with the guidance of MOSL.
Table 7 also presents the results for "Single-" or "Multi-" triplets in a sentence, where the MOSL module makes the similar contributions.As can be seen, the model with MOSL gains more improvements when the review contains multiple triplets.
In addition, we attempt to mix the test sets of datasets Res14, Res15, and Res16 to evaluate the performance of the model under multi-triplet setting5 .The ratio of the averaged improvement of the multi-triple to the single-triple setting on three single dataset is 1.77 while it increases up to 3.15 on the mixed dataset.This is because all aspect/opinion features in MOSL point to the same marker "aspect/opinion".This allows the marker to share knowledge across different aspect/opinion features, thus the text generation module holds the clue from the shared marker about the subsequent aspect/opinion term when generating the prior ones.

Analysis on Computational Cost
To demonstrate that our model does not bring too much computational cost, we compare it with GAS in terms of the number of parameters and inference time as shown in Table 8.We also analyze the costs of the key components in our model to show their impact on complexity.Firstly, the MOSL module adds only about 2.3M parameters compared with GAS.Secondly, we find that the constrained decoding algorithm increases the inference time as our implementation of constrained decoding algorithm requires determining the candidate vocabulary according to the current input token at each decoding time step, which undermines the parallelism of the generation model during inference.Moreover, bidirectional templates require the model to generate target sequences based on two different decoding orders which also increases inference time to some extent.However, SLGM does not show significant differences from GAS in terms of model parameters and inference time because GAS needs to take a prediction normalization strategy to refine the It feels cheap , the keyboard is not very sensitive .
It feels cheap , the keyboard is not very sensitive .
It feels cheap , the keyboard is not very sensitive .
Gold SSI+SEL SLGM It feels cheap , the keyboard is not very sensitive .

PARAPHASE
At home , so built in screen size is not terribly important .
At home , so built in screen size is not terribly important .
At home , so built in screen size is not terribly important .
At home , so built in screen size is not terribly important .prediction results.

Case Study
We conduct a case study on two reviews to compare typical generative methods, including PARA-PHRASE (Zhang et al., 2021a), SSI+SEL (Lu et al., 2022), and our method.The results are as shown in Fig. 5.
For the first review (the left one in Fig. 5), SSI+SEL and PARAPHRASE cannot recognize the opinion term "cheap", whereas "not very sensitive" is recognized by all methods.In contrast, our SLGM can identify both terms.To have a close look, we further visualize the BIO probabilities output by MOSL in Fig. 6.As we can see in the left part of Fig. 6, the opinion marker in MOSL focuses on two opinion terms simultaneously when the generation module generates the first triplet, which helps the model know that there are two related opinion terms for the aspect term "keyboard".
For the second review (the right one in Fig. 5), both SSI+SEL and PARAPHRASE find the approximate locations of the aspect and opinion terms, but neither of them gets correct pairs due to incomplete decoding.The reason is that these two methods lack the corresponding prompt information for boundary identification.Meanwhile, as can be seen from the right part of Fig. 6, the aspect marker in MOSL focuses on the complete aspect term, which contains the boundary information that can help our generation module to decode the complete aspect term.

Conclusion
In this paper, we exploit the power of text generation and sequence labeling for ASTE.We propose two bidirectional templates to reflect the mutual aspect-opinion dependency for filtering false paired triplets.We also present a marker-oriented sequence labeling module to help the text generation module tackle complex structures in the subsequent decoding process.Experiment results show that our framework consistently outperforms all generative and non-generative baselines under both the full supervised and low-resource settings.

Limitations
Although our proposed method achieves the stateof-art performance, it still has a few limitations.Firstly, we only consider the dependency between aspect and opinion in the target text yet ignoring the order influence in the input text, which may bring more improvements.Secondly, there are three label types for ASTE, including aspect, opinion, and sentiment.Currently, we only utilize the aspect and opinion markers in the marker-oriented sequence labeling module.We believe that the specific design for the sentiment marker can further improve the performance, which can be a future direction.

Figure 1 :
Figure 1: (a) shows an example for the ASTE task.(b) and (c) illustrate the difference between our proposed generative method and existing ones for this task, where X is the input sentence, and T denotes the target triplet.X a /X o contains the prompt prefix to define the decoding order (aspect or opinion first) while Y a /Y o indicates the generated sequences following the order in X a /X o .MOSL is our marker-oriented sequence labeling module to improve the generative model's ability of handling complex structures.

Figure 2 :
Figure 2: The overall architecture of our sequence labeling enhanced generative model (SLGM).

Figure 3 :
Figure 3: The process of template-guided text generation enhanced by the MOSL module.

Figure 5 :OFigure 6 :
Figure 5: Case Study.The aspect and opinion terms are highlighted in green and blue, respectively.The orange line denotes the aspect term matches the opinion term and the model correctly predicts the sentiment polarity.

Table 1 :
2 Statistics of the datasets.#S, #T, and N are the number of sentences, triplets, and triplets in a sentence.#MW denotes the number of triplets where at least one of aspect/opinion terms contains multiple words.

Table 2 :
Chen et al. (2022)ised settings.The baseline results with " ‡" are retrieved fromYan et al. (2021);Xu et al. (2021);Chen et al. (2022).We reproduce the generative methods with " †" by using their released code.The best and the second best F 1 scores are in bold and underlined, respectively.The * marker denotes the statistically significant improvements with p < 0.01 over the second best results by SSI+SEL.
*Table3: Results for low-resource settings, where AVG-S and AVG-R are the average results across 3 few-shot and 3 low-resource settings, respectively.The best F 1 scores are in bold.The * marker denotes the statistically significant improvements with p < 0.01 over SSI+SEL.

Table 5 :
Results for ablation study under supervised settings.

Table 6 :
Results for ablation study under low-resource settings for constrained decoding (CD) on the Res16 dataset.

Table 7 :
Impacts of the MOSL module with different evaluation modes.

Table 8 :
Complexity analysis on Lap14 dataset.The results marked with † are reproduced based on the released code.