A Unified One-Step Solution for Aspect Sentiment Quad Prediction

Aspect sentiment quad prediction (ASQP) is a challenging yet significant subtask in aspect-based sentiment analysis as it provides a complete aspect-level sentiment structure. However, existing ASQP datasets are usually small and low-density, hindering technical advancement. To expand the capacity, in this paper, we release two new datasets for ASQP, which contain the following characteristics: larger size, more words per sample, and higher density. With such datasets, we unveil the shortcomings of existing strong ASQP baselines and therefore propose a unified one-step solution for ASQP, namely One-ASQP, to detect the aspect categories and to identify the aspect-opinion-sentiment (AOS) triplets simultaneously. Our One-ASQP holds several unique advantages: (1) by separating ASQP into two subtasks and solving them independently and simultaneously, we can avoid error propagation in pipeline-based methods and overcome slow training and inference in generation-based methods; (2) by introducing sentiment-specific horns tagging schema in a token-pair-based two-dimensional matrix, we can exploit deeper interactions between sentiment elements and efficiently decode the AOS triplets; (3) we design ``[NULL]'' token can help us effectively identify the implicit aspects or opinions. Experiments on two benchmark datasets and our released two datasets demonstrate the advantages of our One-ASQP. The two new datasets are publicly released at \url{https://www.github.com/Datastory-CN/ASQP-Datasets}.


Introduction
Aspect-based sentiment analysis (ABSA) is a critical fine-grained opinion mining or sentiment analysis problem that aims to analyze and understand people's opinions or sentiments at the aspect level (Liu, 2012;Pontiki et al., 2014;Zhang et al., 2022).Typically, there are four fundamental  1: The outputs of an example, "touch screen is not sensitive", for various ABSA tasks.a, c, o, s, and NEG are defined in the first paragraph of Sec. 1.
sentiment elements in ABSA: (1) aspect category (c) defines the type of the concerned aspect; (2) aspect term (a) denotes the opinion target which is explicitly or implicitly mentioned in the given text; (3) opinion term (o) describes the sentiment towards the aspect; and (4) sentiment polarity (s) depicts the sentiment orientation.For example, given an opinionated sentence, "touch screen is not sensitive," we can obtain its (c, a, o, s)-quadruple as ("Screen#Sensitivity", "touch screen", "not sensitive", NEG), where NEG indicates the negative sentiment polarity.Due to the rich usage of applications, numerous research efforts have been made on ABSA to predict or extract fine-grained sentiment elements (Jiao et al., 2019;Pontiki et al., 2014Pontiki et al., , 2015Pontiki et al., , 2016;;Zhang et al., 2022;Yang et al., 2021).Based on the number of sentimental elements to be extracted, existing studies can be categorized into the following tasks: (1) single term extraction includes aspect term extraction (ATE) (Li and Lam, 2017;He et al., 2017), aspect category detection (ACD) (He et al., 2017;Liu et al., 2021); (2) pair extraction includes aspect-opinion pairwise extraction (AOPE) (Yu et al., 2019;Wu et al., 2020), aspect-category sentiment analysis (ACSA) (Cai et al., 2020;Dai et al., 2020), and End-to-End ABSA (E2E-ABSA) (Li et al., 2019b;He et al., 2019) to extract the aspect and its sentiment; (3) triplet extraction includes aspectsentiment triplet extraction (ASTE) (Mukherjee et al., 2021;Chen et al., 2021), and Target Aspect Sentiment Detection (TASD) (Wan et al., 2020); ( 4) quadruple extraction includes aspectcategory-opinion-sentiment (ACOS) quadruple extraction (Cai et al., 2021) and aspect sentiment quad prediction (ASQP) (Zhang et al., 2021a).ACOS and ASQP are the same tasks, which aim to extract all aspect-category-opinion-sentiment quadruples per sample.Since ASQP covers the whole task name, we use ASQP to denote the ABSA quadruple extraction task.Table 1 summarizes an example of the outputs of various ABSA tasks.This paper focuses on ASQP because it provides a complete aspect-level sentiment analysis (Zhang et al., 2022).We first observe that existing ASQP datasets are crawled from only one source and are small with low-density (Cai et al., 2021;Zhang et al., 2021a).For example, the maximum sample size is around 4,000, while the maximum number of quadruples per sample is around 1.6.This limits the technical development of ASQP.Second, ASQP includes two extraction subtasks (aspect extraction and opinion extraction) and two classification subtasks (category classification and sentiment classification).Modeling the four subtasks simultaneously is challenging, especially when the quadruples contain implicit aspects or opinions (Cai et al., 2021).Though existing studies can resolve ASQP via pipeline-based (Cai et al., 2021) or generation-based methods (Zhang et al., 2021a;Mao et al., 2022;Bao et al., 2022;Gao et al., 2022), they suffer from different shortcomings, i.e., pipeline-based methods tend to yield error propagation while generation-based methods perform slowly in training and inference.
To tackle the above challenges, we first construct two datasets, en-Phone and zh-FoodBeverage, to expand the capacity of datasets.en-Phone is an English ASQP dataset in the cell phone domain collected from several e-commercial platforms, while zh-FoodBeverage is the first Chinese ASQP dataset collected from multiple sources under the categories of Food and Beverage.Compared to the existing ASQP datasets, our datasets have 1.75 to 4.19 times more samples and a higher quadruple density of 1.3 to 1.8.This achievement is a result of our meticulous definition and adherence to an-notation guidelines, which allow us to obtain more fine-grained quadruples.
After investigating strong ASQP baselines, we observed a decline in performance on our newly released dataset.This finding, coupled with the shortcomings of the existing baselines, motivated us to develop a novel one-step solution for ASQP, namely One-ASQP.As illustrated in Fig. 1, our One-ASQP adopts a shared encoder from a pretrained language model (LM) and resolves two tasks, aspect category detection (ACD) and aspectopinion-sentiment co-extraction (AOSC) simultaneously.ACD is implemented by a multi-class classifier and AOSC is fulfilled by a token-pair-based two-dimensional (2D) matrix with the sentimentspecific horns tagging schema, a popular technique borrowed from the joint entity and relation extraction (Wang et al., 2020;Shang et al., 2022).The two tasks are trained independently and simultaneously, allowing us to avoid error propagation and overcome slow training and inferring in generationbased methods.Moreover, we also design a unique token, "[NULL]", appending at the beginning of the input, which can help us to identify implicit aspects or opinions effectively.
Our contributions are three-fold: (1) We construct two new ASQP datasets consisting of more fine-grained samples with higher quadruple density while covering more domains and languages.Significantly, the released zh-FoodBeverage dataset is the first Chinese ASQP dataset, which provides opportunities to investigate potential technologies in a multi-lingual context for ASQP.(2) We propose One-ASQP to simultaneously detect aspect categories and co-extract aspect-opinion-sentiment triplets.One-ASQP can absorb deeper interactions between sentiment elements without error propagation and conquer slow performance in generationbased methods.Moreover, the delicately designed "[NULL]" token helps us to identify implicit aspects or opinions effectively.(3) We conducted extensive experiments demonstrating that One-ASQP is efficient and outperforms the state-of-the-art baselines in certain scenarios.

Datasets
We construct two datasets1 to expand the capacity of existing ASQP datasets.

Annotation
A team of professional labelers is asked to label the texts following the guidelines in Appendix A.2. Two annotators individually annotate the same sample by our internal labeled system.The strict quadruple matching F1 score between two annotators is 77.23%, which implies a substantial agreement between two annotators (Kim and Klinger, 2018).In case of disagreement, the project leader will be asked to make the final decision.Some typical examples are shown in Table 10.

ASQP Formulation
Given an opinionated sentence x, ASQP is to predict all aspect-level sentiment quadruples {(c, a, o, s)}, which corresponds to the aspect category, aspect term, opinion term, and sentiment polarity, respectively.The aspect category c belongs to a category set C; the aspect term a and the opinion term o are typically text spans in x while they can be null if the target is not explicitly mentioned, i.e., a ∈ V x ∪ {∅} and o ∈ V x ∪ {∅}, where V x denotes the set containing all possible continuous spans of x.The sentiment polarity s belongs to one of the sentiment classes, SENTIMENT={POS, NEU, NEG}, which corresponds to the positive, neutral, and negative sentiment, respectively.

One-ASQP
Our One-ASQP resolves two subtasks, ACD and AOSC, simultaneously, where ACD seeks a classifier to determine the aspect categories, and AOSC is to extract all (a, o, s)-triplets.Given x with n-tokens, we construct the input as follows: where the token [NULL] is introduced to detect implicit aspects or opinions; see more details in Sec.3.2.2.Now, via a pre-trained LM, both tasks share a common encoder to get the representations: where d is the token representation size.

Aspect Category Detection
We apply a classifier to predict the probability of category detection: where Here, |C| is the number of categories in C. Hence, C ∈ R |C|×(n+1) , where C ij indicates the probability of the i-th token to the j-th category.

AOSC
We tackle AOSC via a token-pair-based 2D matrix with the sentiment-specific horns tagging schema to determine the positions of aspect-opinion pairs and their sentiment polarity.
Tagging We define four types of tags: (1) AB-OB denotes the cell for the beginning position of an aspect-opinion pair.For example, as ("touch screen", "not sensitive") is an aspect-opinion pair, the cell corresponding to ("touch", "not") in the 2D matrix is marked by "AB-OB".
(2) AE-OE indicates the cell for the end position of an aspect-opinion pair.Hence, the cell of ("screen", "sensitive") is marked by "AE-OE".
(3) AB-OE-*SENTIMENT defines a cell with its sentiment polarity, where the row position denotes the beginning of an aspect and the column position denotes the end of an opinion.Hence, the cell of ("touch", "sensitive") is tagged by "AB-OE-NEG".As SEN-TIMENT consists of three types of sentiment polarity, there are three cases in AB-OE-*SENTIMENT.
Triplet Decoding Since the tagged 2D matrix has marked the boundary tokens of all aspectopinion pairs and their sentiment polarity, we can decode the triples easily.First, by scanning the 2D matrix column-by-column, we can determine the text spans of an aspect, starting with "AB-OE-*SENTIMENT" and ending with "AE-OE".Similarly, by scanning the 2D matrix rowby-row, we can get the text spans of an opinion, which start from "AB-OB" and end with "AB-OE-*SENTIMENT".Finally, the sentiment polarity can be easily determined by "AB-OE-*SENTIMENT".
Implicit Aspects/Opinions Extraction Detecting implicit aspects or opinions is critical in ASQP (Cai et al., 2021).Here, we append the "[NULL]" token at the beginning of the input.
Our One-ASQP can then easily determine the cases of Implicit Aspects and Explicit Opinions (IA&EO) and Explicit Aspects and Implicit Opinions (EA&IO).The whole procedure is similar to the above triplet decoding: when the text spans at the row of "[NULL]" start from "AB-OB" and end with "AB-OE-*SENTIMENT", we can obtain an explicit opinion without aspect.Meanwhile, when the text spans at the column of "[NULL]" start from "AB-OE-*SENTIMENT" and ends with "AE-OE", we can obtain an explicit aspect without opinion.
As shown in Fig. 1, we can quickly obtain the corresponding aspect-opinion pairs as "(NULL, very speedy)" and "(express package, NULL)".The sentiment polarity can also be determined by "AB-OE-*SENTIMENT" accordingly.Although the current setting for IA&IO cannot be solved directly, it is possible to resolve it in two steps.First, we can identify IA&IO using tools such as Extract-Classify-ACOS (Cai et al., 2021).Then, we can classify aspect categories and sentiment polarity.However, a unified solution with One-ASQP is left for future work.
Tagging Score Given H, we compute the probabilities of the (i, j)-th cell to the corresponding tags by: where W a ∈ R D×d and W o ∈ R D×d are the weight matrices for the aspect token and the opinion token, respectively, b a ∈ R D and b o ∈ R D are the biases for the aspect token and the opinion token, respectively.D is the hidden variable size set to 400 as default.

Training Procedure
Training We train ACD and AOSC jointly by minimizing the following loss function: where α and β are trade-off constants set to 1 for simplicity.The ACD loss L ACD and the AOSC loss L AOSC are two cross-entropy losses defined as follows: where C ij is the predicted category computed by Eq. ( 3), y C ij ∈ {0, 1} and it is 1 when the i-th token is assigned to the j-th category and 0 otherwise.P ij is the predicted tagging score computed by Eq. ( 6) for all five types of tags while Y ij ∈ R 5 is the ground-truth one-hot encoding.
During training, we implement the negative sampling strategy as (Li et al., 2021) to improve the performance of our One-ASQP on unlabeled quadruples.We set the negative sampling rate to 0.4, a suggested range in (Li et al., 2021) that has yielded good results.Specifically, to minimize the loss in Eq.( 7), we randomly sample 40% of unlabeled entries as negative instances, which correspond to '0' in ACD and '-' in AOSC, as shown in Fig. 1.

Quadruples Decoding
After attaining the model, we can obtain the category sequences of ACD and the AOS triplets in the AOSC matrix simultaneously.We then decode the quadruples in one step via their common terms.For example, as shown in Fig. 1, we can merge (Logistics#Speed, express package) and (express package, NULL, POS) via the common aspect term, "express package", and obtain the quadruple (Logistics#Speed, express package, NULL, POS).
Overall, our One-ASQP consists of two independent tasks, ACD and AOSC.Their outputs only share in the final decoding stage and do not rely on each other during training as the pipeline-based methods need.This allows us to train the model efficiently and decode the results consistently in both training and test.

Experimental Setup
Datasets We conduct the experiments on four datasets in Table 2.For Restaurant-ACOS and Laptop-ACOS, we apply the original splitting on the training, validation, and test sets (Cai et al., 2021).For en-Phone and zh-FoodBeverage, the splitting ratio is 7:1.5:1.5 for training, validation, and test, respectively.
Evaluation Metrics We employ F1 scores as the main evaluation metric and also report the corresponding Precision and Recall scores.A sentiment quad prediction is counted as correct if and only if all the predicted elements are exactly the same as the gold labels.The time cost is also recorded to demonstrate the efficiency of One-ASQP.
Implementation Details One-ASQP is implemented by PyTorch 1.13.1.All experiments are run on a workstation with an Intel Xeon E5-2678 v3@2.50GHzCPU, 256G memory, a single A5000 GPU, and Ubuntu 20.04.For English datasets, we adopt LMs of DeBERTaV3-base and DeBERTaV3large (He et al., 2021), which contain 12 layers with a hidden size of 768 and 24 layers with a hidden size of 1024, respectively.For the Chinese dataset, we adopt MacBERT (Cui et al., 2020), a Chinese LM with the same structure as DeBERTaV3.For the English datasets, the maximum token length is set to 128 as the maximum average word length is only 25.78, as shown in Table 2.For the zh-FoodBeverage, the maximum token length is 256.The batch size and learning rate for all experiments are [32, 3e-5] as they can perform well.We monitor the F1 score on the validation set and terminate the training when no score drops for four epochs.Finally, we report the scores on the test set by the best model on the validation set.
Baselines We compare our One-ASQP with strong baselines: (1) pipeline-based methods consist of four methods, i.e., DP-ACOS, JET-ACOS, TAS-BERT-ACOS, and Extract-Classify- Table 3 reports the comparison results on two existing ASQP datasets.Since all methods apply the same splitting on these two datasets, we copy the results of baselines from corresponding references.The results show that: (1) Generation-based methods gain significant improvement over pipelinebased methods as pipeline-based methods tend to propagate the errors.(2) Regarding generationbased methods, OTG attains the best performance on the F1 score.The exceptional performance may come from integrating various features, e.g., syntax and semantic information, for forming the opinion tree structure (Bao et al., 2022).(3) Our One-ASQP is competitive with generation-based methods.By checking the LM sizes, we know that the generation-based baselines except BARTABSA apply T5-base as the LM, which consists of 220M parameters.In comparison, our One-ASQP model utilizes DeBERTaV3, which consists of only 86M and 304M backbone parameters for its base and large versions, respectively.The compact model parameter size is a crucial advantage of our approach.However, on the Restaurant-ACOS and Laptop-ACOS datasets, One-ASQP falls slightly behind some generation-based methods that can take advantage of the semantics of sentiment elements by generating natural language labels.In contrast, One-ASQP maps each label to a specific symbol, similar to the numerical indexing in classification models.Unfortunately, the limited quantity of these datasets prevents our One-ASQP model from achieving optimal performance.We further conduct experiments on en-Phone and zh-FoodBeverage and compare our One-ASQP with three strong baselines, Extract-Classify-ACOS, Paraphrase, and GEN-SCL-NAT.We select them because Extract-Classify-ACOS is the best pipeline-based method.Furthermore, Paraphrase and GEN-SCL-NAT are two strong generationbased baselines releasing source codes, which is easier for replication.Results in Table 4 are averaged by five runs with different random seeds and show that our One-ASQP, even when adopting the base LM version, outperforms three strong baselines.We conjecture that good performance comes from two reasons: (1) The newly-released datasets contain higher quadruple density and fine-grained sentiment quadruples.This increases the task difficulty and amplifies the order issue in generationbased methods (Mao et al., 2022), i.e., the orders between the generated quads do not naturally exist, or the generation of the current quads should not condition on the previous ones.More evaluation tests are provided in Sec.5.4.(2) The number of categories in the new datasets is much larger than Restaurant-ACOS and Laptop-ACOS.This also increases the search space, which tends to yield generation bias, i.e., the generated tokens neither come from the original text nor the pre-defined categories and sentiments.Overall, the results demonstrate the significance of our released datasets for further technical development.twice and Paraphrase can only decode the token sequentially.To provide a fair comparison, we set the batch size to 1 and show the inference time in the round bracket.The overall results show that our One-ASQP is more efficient than the baselines.Our One-ASQP can infer the quadruples parallel, which is much favorite for real-world deployment.

Effect of Handling Implicit Aspects/Opinions
Table 6 reports the breakdown performance of the methods in addressing the implicit aspects/opinions problem.The results show that (1) the generationbased baseline, GEN-SCL-NAT, handles EA&IO better than our One-ASQP when the quadruple density is low.Accordingly, One-ASQP performs much better than GEN-SCL-NAT on IA&EO in Restaurant-ACOS.GEN-SCL-NA performs worse in IA&EO may be because the generated decoding space of explicit opinions is huge compared to explicit aspects.(2) In en-Phone and zh-FoodBeverage, One-ASQP consistently outperforms all baselines on EA&EO and EA&IO.Our One-ASQP is superior in handling implicit opinions when the datasets are more fine-grained.

Ablation Study on ADC and AOSC
To demonstrate the beneficial effect of sharing the encoder for ADC and AOS tasks.We train these two tasks separately, i.e., setting (α, β) in Eq. 7 to (1.0, 0.0) and (0.0, 1.0).The results in Table 7 show that our One-ASQP absorbs deeper information between two tasks and attains better performance.By sharing the encoder and conducting joint training, the connection between the category and other sentiment elements can become more tightly integrated, thereby contributing to each other.

Effect of Different Quadruple Densities
We conduct additional experiments to test the effect of different quadruple densities.Specifically, we keep those samples with only one quadruple in en-Phone and zh-FoodBeverage and construct two lower-density datasets, en-Phone (one) and zh-FoodBeverage (one).We then obtain 1,528 and 3,834 samples in these two datasets, respectively, which are around one-fifth and two-fifth of the original datasets accordingly.
We only report the results of our OneASQP with the base versions of the corresponding LMs and Paraphrase.Results in Table 8 show some notable observations: (1) Paraphrase can attain better performance on en-Phone (one) than our OneASQP.It seems that generation-based methods are powerful in the low-resource scenario.However, the performance is decayed in the full datasets due to the generation order issue.(2) Our One-ASQP significantly outperforms Paraphrase in zh-FoodBeverage for both cases.The results show that our OneASQP needs sufficient training samples to perform well.However, in zh-FoodBeverage (one), the number of labeled quadruples is 3,834.The effort is light in real-world applications.

Error Analysis and Case Study
To better understand the characteristics of our One-ASQP, especially when it fails.We conduct the error analysis and case study in this section.We check the incorrect quad predictions on all datasets and show one typical error example for each type from Laptop-ACOS in Fig. 2, where we report the percentage of errors for better illustration.The re-sults show that (1) In general, extracting aspects and opinions tends to introduce larger errors than classifying categories and sentiments.Aspects and opinions have more complex semantic definitions than categories and sentiments, and extracting implicit cases further increases the difficulty of these tasks.
(2) There is a significant category error in Laptop-ACOS, likely due to an imbalance issue in which there are 121 categories with relatively small samples per category.For example, 35 categories have less than two quadruples.
(3) The percentage of opinion errors is higher than that of aspect errors in all datasets because opinions vary more than aspects, and there are implicit opinions in the new datasets.This is reflected in the numbers of opinion errors in en-Phone and zh-FoodBeverage, which are 125 (37.31%) and 395 (49.94%), respectively, exceeding the corresponding aspect errors of 99 (29.55%) and 246 (31.10%).
Removing samples with implicit opinions reduces the opinion errors to 102 and 260 in en-Phone and zh-FoodBeverage, indicating that explicit opinion errors are slightly larger than explicit aspect errors.
(4) The percentage of sentiment errors is relatively small, demonstrating the effectiveness of our proposed sentiment-specific horns tagging schema.

Related Work
ABSA Benchmark Datasets are mainly provided by the SemEval'14-16 shared challenges (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016)).The initial task is only to identify opinions expressed about specific entities and their aspects.In order to investigate more tasks, such as AOPE, E2E-ABSA, ASTE, TASD, and ASQP, researchers have re-annotated the datasets and constructed some new ones (Fan et al., 2019; Li  (2) the data size is usually small, where the maximum one is only around 4,000; (3) there is only a labeled quadruple per sentence and many samples share a common aspect, which makes the task easier; (4) the available public datasets are all in English.The shortcomings of existing benchmark datasets motivate us to crawl and curate more data from more domains, covering more languages and with higher quadruple density.ASQP aims to predict the four sentiment elements to provide a complete aspect-level sentiment structure (Cai et al., 2021;Zhang et al., 2021a).The task is extended to several variants, e.g., capturing the quadruple of holder-target-expression-polarity (R et al., 2022;Lu et al., 2022) or the quadruple of target-aspect-opinion-sentiment in a dialogue (Li et al., 2022).Existing studies can be divided into the pipeline or generation paradigm.A typical pipeline-based work (Cai et al., 2021) has investigated different techniques to solve the subtasks accordingly.It consits of ( 1) first exploiting double propagation (DP) (Qiu et al., 2011) or JET (Xu et al., 2020) to extract the aspect-opinion-sentiment triplets and after that, detecting the aspect category to output the final quadruples; (2) first utilizing TAS-BERT (Wan et al., 2020) and the Extract-Classify scheme (Wang et al., 2017) to perform the aspect-opinion co-extraction and predicting category-sentiment afterward.Most studies fall in the generation paradigm (Zhang et al., 2021a;Mao et al., 2022;Bao et al., 2022;Gao et al., 2022).Zhang et al. (2021a) is the first generationbased method to predict the sentiment quads in an end-to-end manner via a PARAPHRASE model-ing paradigm.It has been extended and overcome by Seq2Path (Mao et al., 2022) or tree-structure generation (Mao et al., 2022;Bao et al., 2022) to tackle the generation order issue or capture more information.Prompt-based generative methods are proposed to assemble multiple tasks as LEGO bricks to attain task transfer (Gao et al., 2022) or tackle few-shot learning (Varia et al., 2022).GEN-SCL-NAT (Peper and Wang, 2022)

Conclusions
In this paper, we release two new datasets, with the first dataset being in Chinese, for ASQP and propose One-ASQP, a method for predicting sentiment quadruples simultaneously.One-ASQP utilizes a token-pair-based 2D matrix with sentiment-specific horns tagging, which allows for deeper interactions between sentiment elements, enabling efficient decoding of all aspect-opinion-sentiment triplets.

A More Details about Datasets Construction
This section provides more details about constructing the two datasets, en-Phone and zh-FoodBeverage.

A.1 Data Sources
The English ASQP dataset, en-Phone, is collected from reviews on Amazon UK3 , Amazon India4 and Shopee5 in July and August of 2021, covering 12 cell phone brands, such as Samsung, Apple, Huawei, OPPO, Xiaomi, etc.The first Chinese ASQP dataset, zh-FoodBeverage, is collected from the Chinese comments on forums6 , Weibo7 , news8 and e-commerce platforms9 in the years 2019-2021 under the categories of Food and Beverage.

A.2 Annotation Guidelines
The following outlines the guidelines for annotating the four fundamental sentiment elements of ASQP and their outcomes.It can be noted that our labeled ASQP quadruples are more fine-grained and more difficult than those in existing ASQP benchmark datasets.

A.2.1 Aspect Categories
The aspect category defines the type of the concerned aspect.Here, we apply a two-level category system, which is defined by our business experts for the sake of commercial value and more detailed information.For example, "Screen" is a first-level category.It can include secondlevel categories, such as "Clarity", "General", and "Size", to form the final second-level categories as "Screen#Clarity", "Screen#General", and "Screen#Size".In the experiments, we only consider the second-level categories.
As reported in Table 2, the number of categories for en-Phone and zh-FoodBeverage is 88 and 138, respectively.The number of labeled quadruples per category is larger than 5. Though Laptop-ACOS consists of 121 categories, if we filter out the categories with less than 5 annotated quadruples, the number of categories is reduced to 75.Hence, we provide more dense and rich datasets for ASQP.

A.2.2 Aspect Terms
The aspect term is usually a noun or a noun phrase, indicating the opinion target, in the text.It can be implicit in a quadruple (Cai et al., 2021).For the sake of commercial analysis, we exclude sentences without aspects.Moreover, to provide more finegrained information, we include three additional rules: • The aspect term can be an adjective or verb when it can reveal the sentiment categories.For example, as the example of en-Phone in Table 10, "recommended" is also labeled as an aspect in "Highly recommended" because it can identify the category of "Buyer_Atitude#Willingness_Recommend".In Ex. 1 and Ex. 4 of Table 9, "clear" and "cheap" are labeled as the corresponding aspect terms because they can specify the category of "Screen#Clarity" and "Price#General", accordingly.• Pronoun is not allowed to be an aspect term as it cannot be identified by the quadruples only.For example, in the example of "pretttyyyy and affordable too!!!I love it!! Thankyouuu!!", "it" cannot be labeled as the aspect though we know it indicates a phone from the context.• Top-priority in labeling fine-grained aspects.
For example, in the example of "Don't purchase this product", "purchase" is more related to a customer's purchasing willingness while "product" is more related to the overall comment, we will label "purchase" as the aspect.

A.2.3 Opinion Terms
The opinion term describes the sentiment towards the aspect.An opinion term is usually an adjective or a phrase with sentiment polarity.Here, we

B.2 Effect of Variants of Interactions
Though our One-ASQP separates the task into ACD and AOSC.There are still other variants to resolve the ASQP task.Here, we consider two variants: Variant 1: The ASQP task is separated into three sub-tasks: aspect category detection (ACD), aspect-opinion pair extraction (AOPC), and sentiment detection.More specifically, ACD and sentiment detection are fulfilled by classification models.For AOPC, we adopt the sentiment-specific horns tagging schema proposed in Sec.3.2.2.That is, we only co-extract the aspect-opinion pairs.In the implementation, we set the tags of AB-OE-*SENTIMENT to AB-EO and reduce the number of tags for AOSC to three, i.e., {AB-OB, AE-OE, AB-OE}.
Variant 2: We solve the ASQP task by a unified framework.Similarly, via the sentiment-specific horns tagging schema proposed in Sec.3.2.2,we extend the tags of AB-EO-*SENTIMENT to AB-OE-*SENTIMENT-*CATEGORY.Hence, the number of tags increases from 5 to 2 + |S| * |C|, where |S| is the number of sentiment polarities and |C| is the number of categories.This setting allows us to extract the aspect-opinion pairs via the 2D matrix while decoding the categories and sentiment polarities via the tags.
Tabel 12 reports the compared results on four datasets, where the base versions of the corresponding LMs are applied.The results show that (1) our One-ASQP performs the best over the proposed two variants.We conjecture that the aspect-opinionsentiment triplets are in a suitable tag space and our One-ASQP can absorb their interactions effectively.
(2) Variant 2 performs the worst among all results.We conjecture that the search tag space is too large and the available datasets do not contain enough information to train the models.and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?A D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used? 2 D4.Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: The structure of our One-ASQP: solving ACD and AOSC simultaneously.ACD is implemented by a multi-class classifier.AOSC is fulfilled by a token-pair-based 2D matrix with sentiment-specific horns tagging.The results in the row of "[NULL]" indicate no aspect for the opinion of "very speedy".In contrast, the results in the column of "[NULL]" imply no opinion for the aspect of "express package".

Figure 2 :
Figure 2: Error analysis and case study.Though the predicted aspect and opinion differ from the golden ones in the above examples, they seem correct.
is introduced to exploit supervised contrastive learning and a new structured generation format to improve the naturalness of the output sequences for ASQP.However, existing methods either yield error propagation in the pipeline-based methods or slow computation in the generation-based methods.The shortcomings of existing methods motivate us to propose One-ASQP.

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? 4 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? 4 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 2, 4 D Did you use human annotators (e.g., crowdworkers) or research with human participants?2, A D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? A D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students)

Table 2 :
Data statistics for the ASQP task.# denotes the number of corresponding elements.s, w, c, q stand for samples, words, categories, and quadruples, respectively.EA, EO, IA, and IO denote explicit aspect, explicit opinion, implicit aspect, and implicit opinion, respectively."-" means this item is not included.

Table 4 :
Results of en-Phone and zh-FoodBeverage.Scores are averaged over five runs with different seeds.

Table 5
reports the time cost (in seconds) of training in one epoch and inferring the models on Restaurant-ACOS and en-Phone; more results are in Appendix B.1.The results show that our One-ASQP is much more efficient than the strong baselines as Extract-Classify-ACOS needs to encode

Table 5 :
Time cost (seconds) on Restaurant-ACOS and en-Phone.For a fair comparison with baselines, we record the inference time of our One-ASQP with the batch size of 1 and report them in the round bracket.

Table 6 :
Breakdown performance (F1 scores) to depict the ability to handle implicit aspects or opinions.E and I stand for Explicit and Implicit, respectively, while A and O denote Aspect and Opinion, respectively.

Table 7 :
Ablation study of One-ASQP on two losses.

Table 8 :
Comparison results on different datasets with different quadruple densities.

Table 11 :
Time cost in seconds on all datasets.For a fair comparison with baselines, we record our One-ASQP inference time when setting the batch size to 1 and report them in the round bracket.The default batch size is 32.

Table 12 :
Comparison of One-ASQP with two other variants for ASQP.