Uncertainty-Aware Unlikelihood Learning Improves Generative Aspect Sentiment Quad Prediction

Recently, aspect sentiment quad prediction has received widespread attention in the field of aspect-based sentiment analysis. Existing studies extract quadruplets via pre-trained generative language models to paraphrase the original sentence into a templated target sequence. However, previous works only focus on what to generate but ignore what not to generate. We argue that considering the negative samples also leads to potential benefits. In this work, we propose a template-agnostic method to control the token-level generation, which boosts original learning and reduces mistakes simultaneously. Specifically, we introduce Monte Carlo dropout to understand the built-in uncertainty of pre-trained language models, acquiring the noises and errors. We further propose marginalized unlikelihood learning to suppress the uncertainty-aware mistake tokens. Finally, we introduce minimization entropy to balance the effects of marginalized unlikelihood learning. Extensive experiments on four public datasets demonstrate the effectiveness of our approach on various generation templates.


Introduction
Recently, aspect sentiment quad prediction (ASQP) has received extensive attention in the field of aspect-level sentiment analysis.ASQP targets a comprehensive sentiment understanding and extracts four elements of aspect sentiment, including 1) aspect term (at) which is the concrete aspect description; 2) opinion term (ot) suggesting the specific opinion expression towards the aspect; 3) aspect category (ac) denoting the aspect class; 4) sentiment polarity (sp) indicating the sentiment class of the aspect.For example, given a comment sentence "Service was good and food was wonderful", ASQP aims to recognize two quadruples

Inputs-1
The food is good.

Inputs-2
Yamato is an excellent place to go.
Existing works have pointed out two promising research directions.Cai et al. (2021) propose a pipeline-based method, using the properties of four elements and designing first-extract-then-classify two stages.Another direction leverages generationbased pre-trained language models.ASQP is addressed in an end-to-end manner by "re-writing" a sentence into a structured target sequence (Zhang et al., 2021b,a;Hu et al., 2022).With pre-defined templates, quadruples can be easily decoded from the target sequence.Due to its simplicity and effects, the second paradigm gradually becomes the main streaming in ASQP (Hu et al., 2022).
However, no matter designing good templates (Zhang et al., 2021a;Bao et al., 2022) or using data augmentation (Hu et al., 2022), previous generation-based works only focus on what to generate but ignore what not to generate.Learning signals of negative effects are also crucial for accurate extraction.The reason is that ASQP is not a typical text-generation task, such as dialog (Liu et al., 2021) or storytelling (Xu et al., 2020b).Semantic-similar or ambiguous words are harmful for extraction.Two failed cases of pre-trained language models are presented in Figure 1.In the first case, the aspect term "food" is easily confused with "foods".And the second case also implies that the opinion term "excellent" can be wrongly decoded as "great".Though these words do not obviously change the semantics, they lead to complete mistakes for ASQP.Therefore, how to make language models to avoid errors motivates us.
In this paper, we propose uncertainty-aware unlikelihood learning (UAUL) to guide the likelihood learning (what can be generated) and marginalized unlikelihood learning (what not to generate) simultaneously.Concretely, what to generate is in light of the sequence-to-sequence learning objective.Target sequences are constructed with predefined templates, providing semantic and syntactic structured information.As for what not to generate, we argue that the noise and errors present in the pre-trained model are due to the uncertainty of the model itself.Therefore, we introduce the Monte Carlo dropout (MC dropout) (Gal and Ghahramani, 2016) to obtain built-in negative samples of pretrained models.By dropping out random parameters of the last layer followed by the decoder, i.e. language model head, multiple predictions can be attained, which further tell the inherent errors of language models.
Moreover, with uncertainty-aware negative samples, we further propose marginalized unlikelihood learning (MUL) to suppress the probability of them.The marginalization could promote the gap between correct and error tokens, making models to better distinguish semantically similar or ambiguous words.Finally, MUL reduces the probability of noises.This might enlarge the probability of other errors, since the vocabulary set of language models is with the scale.Hence, to balance the influences of MUL, we propose to minimize the entropy of uncertainty-aware probability distributions.
In summary, the contributions of this paper are as follows: • We study generative ASQP task from the view of what not to generate.To the best of our knowledge, this is the first work to study negative samples in this task.We propose uncertainty-aware unlikelihood learning to avoid the intrinsic mistakes of pre-trained language models.
• Specifically, the model uncertainty is comprehended with MC dropout.And the built-in errors are reduced with the proposed marginalized unlikelihood learning and minimization entropy.
Our method is template-agnostic and can be easily applied to various target templates.
• Experimental results on four public datasets Rest15, Rest16, Restaurant, and Laptop demonstrate that UAUL has universal effectiveness on various templates.

Formulation and Overview
Given a sentence x, aspect sentiment quad prediction (ASQP) aims to predict all aspect-level quadruplets {(at, ot, ac, sp)}.Following the previous generation-based works (Zhang et al., 2021a;Hu et al., 2022), we define projection functions to map the quadruplets (at, ot, ac, sp) into semantic values (x at , x ot , x ac , x sp ).Concretely, 1) if aspect term at is explicit, x at = at, otherwise x at ="it"; 2) if opinion term ot are explicitly mentioned, x ot = ot, otherwise it is mapped as "NULL" if being implicitly expressed; 3) aspect category ac is transformed into words, such as x ac ="food quality" for ac ="food#quality"; 4) the sentiment polarity sp ∈ {positive, neutral, negative}, is mapped into words with sentiment semantics {great, ok, bad}, respectively.Based on the above rules, the values are fed into a template T to form the target sequence.For instance, a template follows the cause and effect semantic relationship "x ac is x sp because x at is x ot " (Zhang et al., 2021a) or uses special markers " [AT] x at [OT] x ot [AC] x ac [SP] x sp " (Hu et al., 2022).If a sentence contains multiple quadruplets, the templated sequences are concatenated with a special marker [SSEP] to obtain the final target sequence y.
As shown in Figure 2, an input sentence is first fed into the encoder-decoder framework.We exploit the pre-trained language model T5 (Raffel et al., 2020).To deal with negative samples2 , we first acquire multiple uncertain-aware samples via MC dropout for each decoding time step.Then these samples are fed to calculate marginalized unlikelihood learning loss.Finally, to enhance the learning of target sequence and balance the effects of MUL, we design enhanced likelihood learning for uncertainty-aware samples.Next, we will introduce the components in detail.

Uncertainty-Aware Samples Acquisition
As depicted in Figure 1, semantic-similar or ambiguous words lead to complete error predictions

Sample mask matrix
Figure 2: An overview of the proposed uncertainty-aware unlikelihood learning (UAUL).We present the details via an example of the first decoding time step.The beginning token "<s>" yields its next token, i.e. "food", where the output is enhanced as three uncertainty-aware probability distributions {p The largest probability is highlighted and chosen as a negative sample.
for ASQP task.The in-depth reason relies on that language models are pre-trained based on distributional semantics theory (Boleda, 2020), producing alike representations for words that frequently appear in similar contexts, such as "excellent" and "great".Then when extracting aspect quadruplets, language models are not sure which one is more accurate.Understanding the inherent uncertainty of language models may have potential benefits.To achieve this goal, we re-design the decoder of T5 and adopt MC dropout (Gal and Ghahramani, 2016) to obtain valuable samples.Uncertainty-Aware Predictions The target sequence y is fed into the decoder as teacher forcing (Williams and Zipser, 1989) during training.The decoder's inner layers are depicted in the right plot of Figure 2. Here, we use the beginning token "<s>" as an example to illustrate the details of each time step.We obtain a representation for each token with multiple transformer-based self-attention mechanisms of the decoder layer.
where h t is calculated based on the input sequence x and previous outputs y <t .Enc − DecLayer indicates the encoder module and the decoder layer.
Then, following Vazhentsev et al. (2022), we only exploit the last dropout layer, which is much less computationally expensive than the standard MC dropout.Specifically, an uncertain-aware representation is obtained by sampling a random mask matrix M (i) .
where sampling M (i) follows the Bernoulli distribution Bernoulli(1 − p) and p ∈ [0, 1] is the dropout rate.Then the output is calculated based on the uncertainty-aware representations.
where W maps h (i) t into a vector and p (i) t indicates the probability distribution over the vocabulary set.We dropout multiple times and attain multiple output distributions at t-th time step {p

Samples Acquisition
Based on multiple uncertain-aware output probability distributions, we then acquire key samples as described in Algorithm 1.Note that this algorithm displays the acquisition of samples for time step t.For the positive samples, we mainly concentrate on the probability of the ground-truth token y t out of each distribution.For the negative samples, we choose the largest wrong prediction.In this way, both the positive and negative samples are integrated with the uncertainty of language models (i.e.various probabilities).Meanwhile, negative samples are also difficultly distinguishable ones.

Algorithm 1 Samples Acquisition
Input: Output probability distributions {p

Marginalized Unlikelihood Learning
Then with these chosen key samples, we propose marginalized unlikelihood learning to explicitly control their optimization.As an example shown in Figure 2, we have three output distribution {p t }.The highlighted probabilities are the largest in that distribution.With Algorithm 1, we obtain P t = {0.8,0.2, 0.2} and N t = {0.7,0.6}, where P t and N t are sampled positive and negative samples, respectively.These probabilities are further utilized to calculate in Eq. ( 4).
where α is the scale hyperparameter.m is the margin between positive and negative samples.n is the length of the target sequence.|P t | means the number of samples in P t .
It is worth noting that the proposed MUL is based on the largest probability in every uncertainty-aware distribution.Putting it to N t according to whether it is correct.The reason is that softmax probabilities tend to be overconfident (Guo et al., 2017), making all other probabilities very small except for the largest one.Then our method can better select easily-mistaken samples from multiple distributions.

Enhanced Likelihood Learning
Except for dealing with the noise issue, what to generated is still important to obtain task-specific semantic and structured knowledge.We exploit the original likelihood training to optimize the positive sample probabilities on multiple uncertainty-aware probability distributions.
where y t denotes a ground true one-hot vector at the time step t.
Though MUL reduces the probability of noise, it might enlarge the probability of other errors.Take Figure 2 as an example, reducing the probability of "foods" may disperse the numerical to other noisy words, such as "feed" or "apple".Thus, the optimization of MUL and likelihood is not fully consistent, which in turn affects the likelihood of learning.To balance likelihood and MUL, we introduce the minimization entropy (ME) loss term.
By minimizing Eq. ( 6), p (i) t will become more peak, that is to say, suppressing the noises simultaneously.In this way, our approach seeks a balance between MUL and likelihood, ensuring both accurate extraction and negative sample reduction.

Joint Training Objective
The final training objective is jointly to combine the above three losses.

Datasets
To evaluate the proposed approach, we choose four publicly available datasets.Rest15 and Rest16 datasets are proposed by Zhang et al. (2021a).They are based on previous SemEval tasks (Pontiki et al., 2015(Pontiki et al., , 2016)), and expanded with quadruplet annotations.Cai et al. (2021) propose Restaurant and Laptop datasets.The Restaurant dataset is constructed based on the SemEval 2016 Restaurant dataset (Pontiki et al., 2016) and its expansion datasets (Fan et al., 2019;Xu et al., 2020a).The Laptop dataset is annotated based on the data collected on Amazon between 2017 and 2018.The statistics of datasets are displayed in Table 1.

Compared Methods
We choose the following strong baseline methods and divided them into two types: i.e. nongeneration and generation.
Non-Generation Baselines: Traditional paradigm designs various stages to extract individual information separately.
• Double-Propagation (Qiu et al., 2011) It is a classical method for triple extraction.Cai et al. (2021) adapt it to ASQP.All {at, ot, sp} triplets are first extracted using double propagation, and then each triplet is assigned ac to attain quad.
• JET (Xu et al., 2020a) It is an end-to-end framework for detecting triplet.Cai et al. (2021) first obtain {at, ot, sp} triples with JET and then leveraged BERT to obtain ac.
• HGCN-BERT+BERT (Zhang et al., 2021a) It is designed for learning syntactic dependencies for ASQP.Its variants include HGCN-BERT+BERT-Linear and HGCN-BERT+BERT-TFM according to the last layer.• Extract-Classify-ACOS (Cai et al., 2021) It first extracts aspect-opinion and then classifies category and sentiment, yielding the final quad.
Generation Baselines: Aspect sentiment quadruplets are fed into semantic templates to obtain a target sequence for generation learning.
• GAS (Zhang et al., 2021b) It is the first work to reformulate all ABSA tasks as generation problems, and process all sub-tasks in a unified generation framework.
• Paraphrase (Zhang et al., 2021a) It transforms the quadruplet extraction into a paraphrase generation through a predefined template.
• Special_Symbols (Hu et al., 2022) It distinguishes the type of element in each position by special symbols.
• DLO (Hu et al., 2022) It designs dataset-level data augmentation via template-order permutation.The templates use special symbols.• ILO (Hu et al., 2022) ILO designs data augmentation for each instance to find the good template order.The templates adopt special symbols.

Overall Results
Experimental results are reported in Table 2 and  Table 3 Moreover, we also see a few exceptions.For example, on Laptop dataset, UAUL causes the performances of GAS and ILO slightly decline.A possible reason is that Laptop dataset has a larger proportion of implicit information.Template treats implicit aspect term as "it" and implicit opinion term as "NULL".Such implicit information makes it hard to understand quadruplets accurately.

Low-Resource Scenario
To further explore the performance of our proposed method in a low-resource environment, we train the model only with subsets of Rest15.The results are reported in Table 4.We can see that for both baseline methods, i.e.Special_Symbols and Paraphrase, UAUL can bring consistent improvements with various data scales.In particular, with only 15% training data, UAUL improves Special_Symbols and Paraphrase significantly by +3.47% (+11.22%relatively) and +4.89% (+17.01%relatively) on F1 score, respectively.This verifies that UAUL shows more significant effectiveness in low-resource scenarios.A rational explanation is that low-resource might boost the overfitting of language models to the small-scale data.Then mistakes will occur  more frequently, which are potentially distributed within the model and are caused by the uncertainty of the model itself.Our method helps the models to understand these potential errors well and addresses them to some extent.Therefore, UAUL is not only template-agnostic but also resourcefriendly.

Ablation Study
To validate the effectiveness of individual components, we perform a systematic ablation study based on Special_Symbols+UAUL.The experimental results are presented in Table 5: Evluation results of ablation study.The minus "-" denotes removing components and the addition "+" denotes adding components.
learning with the original unlikelihood.Firstly, it can be observed that by removing various components, the performances on four datasets are consistently decreasing.This validates the effectiveness of the constituent part of UAUL.Concretely, we see that removing MUL causes significant performance declines on all datasets.This presents that MUL is effective and telling language models what not to generate successfully makes quadruplets extraction more accurate.Moreover, replacing MUL with naive UL also leads to performance drops.This further demonstrates that only using UL is not enough.The proposed MUL can widen the gap between correct and easily-mistaken words and is beneficial for quadruplets prediction.
Secondly, it is found that the full model slightly outperforms the variant of removing ME, suggesting that ME is able to enhance likelihood learning and balance its effects with MUL.We also observe that -MUL-ME+UL, brings consistent degradation.In most experimental settings, it is less performed than -MUL+UL.This further demonstrates the effectiveness of ME.
Finally, we also see that the full model consistently outperforms the variant of removing MC dropout on four datasets.The observation shows that understanding the uncertainty of language models is important to choose crucial mistakes.
Making language models distinguish these easilymistaken words contributes to ASQP.

Comparison with Other Strategies
We further compare our method with choosing negative samples via top-k (Fan et al., 2018) and topp (Holtzman et al., 2020) strategies.Previously, these two strategies are exploited in the inference phase of text generation.Here we borrow their idea to select negative samples in the training phase.Specifically, except for the ground-truth token, all other samples from top-k and top-p are regarded as negative.The evaluation results are depicted in Figure 3.We first see that introducing unlikelihood learning with top-k or top-p strategies can both bring some gains by setting specific k or p values.This demonstrates that learning negative information is effective for ASQP.Yet the gains are very limited.Then it can be observed that Paraphrase+UAUL achieves significant improvements, showing the effectiveness of our approach.This suggests that considering the uncertainty of language models can successfully choose more valuable samples.

Hyperparameter Study
The effects of two hyperparameters are also studied, i.e. m and p, where m is the margin in Eq. ( 4) and p is the MC dropout rate.The curves are depicted in Figure 4. Hyperparameter m It determines the gap extent to learn from negative samples.Fixing dropout to 0.4 and keeping all other parameters the same, we vary m from -1.0 to 0.2.In the left plot of Figure 4, it is found that with most of the values, Special_Symbols+UAUL outperforms the original model.It shows that this hyperparameter has robustness to some extent.Then setting m too small or too large leads to a decrease in performance.If the gap extent is too large, it probably causes overfitting, while if too small, the influences of uncertainty-aware negative samples are limited.Dropout Rate p It determines the proportions of neural connections to drop.Setting various values of p will lead to different extents of uncertainty for language models.As shown in the right plot of Figure 4, we find that keeping p within 0.1 to 0.5 yields good results.However, if p is set to a large value, the scale of weights for optimizing is limited, which affects likelihood training and in turn, causes performance degradation.

Error Analysis
To better understand the limitations of UAUL, we choose a typical method Paraphrase+UAUL, and conduct an error analysis.Two failed cases are shown in Figure 5.
The first sentence implicitly describes an aspect term that the user is "much happier with" rather than "apple product".Thus the ground-truth aspect term is "it", yet our approach predicts "apple product".Similarly, the second case expresses the negative opinion towards "screen" since it has a "a dead pixel".The opinion term is also implicit, but our approach predicts wrongly to an adjective "dead".In summary, an aspect/opinion term may be described which requires deep semantic understanding.Though UAUL achieves consistent performance improvements for various generationbased methods, it struggles to deal with implicit information.

Related Work
Aspect-Base Sentiment Analysis (ABSA) Early studies of ABSA stay at the level of individual elements, such as extracting aspect terms (Xu et al., 2018), detecting aspect categories (Bu et al., 2021;Brauwers and Frasincar, 2022), predicting the sentiment polarity given an aspect term (Huang and Carley, 2018) or an aspect category (Hu et al., 2019).Subsequently, researchers (Schouten and Frasincar, 2015;Zhang et al., 2022) pay attention to the depen-

Inputs-1 was considering an apple product but i ' m much happier with this !
Label-1 (it, happier, laptop general, positive) Pred-1 (apple product, happier, laptop general, positive) Inputs-2 i then noticed a dead pixel on the screen .
Label-2 (screen, NULL, display quality, negative) Pred-2 (screen, dead, display quality, negative) dencies of multiple elements and recognize them simultaneously.Peng et al. (2020) focus on the triplet of aspect opinion sentiment.Recently, ASQP has drawn much attention, dealing with the whole elements, i.e. aspect sentiment quadruplets.To address ASQP, pipeline method (Cai et al., 2021) and generation-based method (Zhang et al., 2021b,a) are proposed.Due to the simplicity and end-toend manner, the generation paradigm has become the main research direction.Promising works design novel approaches based on tree structure (Mao et al., 2022;Bao et al., 2022), contrastive learning (Peper and Wang, 2022) and data augmentation (Hu et al., 2022).Different from the above works, we study ASQP from the perspective of what not to generate and design novel uncertainty-aware unlikelihood learning for the ASQP task.Unlikelihood Learning It is originally proposed in the field of neural text generation (Welleck et al., 2020).It aims to deal with the generation repetition problem, which records the words that have been decoded and suppress their probabilities in the following decoding time steps.Li et al. (2020) introduce unlikelihood loss into dialog generation to address the utterance repetition, frequent words, and logical flaw issues.Song et al. (2021) leverage unlikelihood training to improve the understanding of character consistency in the persona-based dialogue.In this work, semantic-similar or ambiguous tokens are negative information for ASQP.We acquire them via the inherent uncertainty of language models and propose novel marginalized unlikelihood learning to deal with negative samples.Firstly, implicit information is still challenging for UAUL.Failed cases in error analysis §3.3.6 demonstrate that tough cases require in-depth semantic understanding.Though UAUL achieves wide improvements in the generation paradigm, it struggles to deal with implicit cases.

Conclusion
Secondly, in this work, we only design tokenlevel marginalized unlikelihood learning.Since aspect sentiment quadruplets contain four types of information, considering span-level and whole sequence-level negative sample learning may attain further gains.
Thirdly, UAUL increases the training time, as shown in Table 8.We optimize the implementation by parallel computation.Meanwhile, MC dropout is only adopted in the last dropout layer.The training time is still significantly enlarged.Nevertheless, our method does not require additional human labor, which has obvious advantages in real applications.

Inputs Sentence
The food is good.Quadruplet (at, ot, ac, sp)

B.1 Additional Hyperparameter Study
Hyperparameter studies on Restaurant and Rest16 are depicted in Figure 7 and Figure 8.It can be found that on Restaurant, our method outperforms Special_Symbols on various values of m.On Rest16, our method also outperforms Special_Symbols in most cases.This demonstrates that hyperparameter m has high robustness.For the dropout rate p, it is found that keeping p within 0.1 to 0.5, the results are the best, but setting p larger than 0.7 causes performance degradation.A possible explanation is that dropping out too much scale of neural connections reduces the proportion of learnable parameters.

B.2 Training Time Analysis
The average running time of each model is shown in Table 8.We can observe that on five generationbased methods, UAUL consistently causes more training time.Even UAUL has already used parallel computation and the last layer MC dropout in the training phase, the training time is still extremely enlarged.Admittedly the time overhead is a limitation of our approach, but our method does not require additional human labor, which is also very beneficial in practical applications.

Figure 1 :
Figure 1: Two predicted error cases are shown.Pred denotes the prediction.The results of Label and Pred are shown in the order of (at, ot, ac, sp), and the highlighted parts are the predicted error items.

Figure 3 :
Figure 3: Evaluation results of other strategies on Rest15.Para denotes the Paraphrase method.

Figure 5 :
Figure 5: Two error cases predicted by Para-phrase+UAUL from the testing set of Laptop dataset.

Table 1 :
Data statistics.#S and #Q denote the number of sentences and quadruplets respectively.
Restaurant and Laptop datasets, respectively.Similarly, Special_Symbols+UAUL achieves consistent improvements on all datasets, performing the best on Rest15 dataset.Compared with DLO, DLO+UAUL also performs consistently better, achieving the best F1 scores of 60.50% and 60.78% on the Rest16 and Restaurant datasets, respectively.These results demonstrate that the proposed UAUL can be easily applied to various templates with universal effectiveness.

Table 4 :
Evaluation results of low-resource scenario in terms of F1 (%).Radio indicates the proportion of Rest15 dataset's training data.SS and Para are Spe-cial_Symbols and Paraphrase methods for short.∆ denotes the absolute improvements.

Table 5
( food, good, foodquality, positive ) Semantic Mapping (x at , x ot , x ac , x sp ) ( food, good, food quality, great ) GAS (at, ot, ac, sp) Paraphrase x ac is x sp because x at is x ot Target sequence food quality is great because food is good Special Symbols [AT] x at [OT] x ot [AC] x ac [SP] x sp Target sequence [AT] food [OT] good [AC] food quality [SP] greatFigure 6: Template details of the various methods.

Table 8 :
Average running time of each model.