DiaASQ : A Benchmark of Conversational Aspect-based Sentiment Quadruple Analysis

The rapid development of aspect-based sentiment analysis (ABSA) within recent decades shows great potential for real-world society. The current ABSA works, however, are mostly limited to the scenario of a single text piece, leaving the study in dialogue contexts unexplored. To bridge the gap between fine-grained sentiment analysis and conversational opinion mining, in this work, we introduce a novel task of conversational aspect-based sentiment quadruple analysis, namely DiaASQ, aiming to detect the quadruple of target-aspect-opinion-sentiment in a dialogue. We manually construct a large-scale high-quality DiaASQ dataset in both Chinese and English languages. We deliberately develop a neural model to benchmark the task, which advances in effectively performing end-to-end quadruple prediction, and manages to incorporate rich dialogue-specific and discourse feature representations for better cross-utterance quadruple extraction. We hope the new benchmark will spur more advancements in the sentiment analysis community.


Introduction
It is meaningful to empower machines to understand human opinion and sentiment, which motivates the study of sentiment analysis (Pang and Lee, 2007;McDonald et al., 2007;Ren et al., 2016;Cambria, 2016).ABSA is an important branch of sentiment analysis aiming to detect the sentiment trends towards the fine-grained aspects of targets, which has received consistent research attention within last few years (Li et al., 2018;Fan et al., 2019;Chen et al., 2020;Wu et al., 2021;Chen et al., 2022a).The initial ABSA revolves around the study of aspect terms and sentiment polarities (Tang et al., 2016;Fan et al., 2018;Li et   2019).Later, the extraction of opinion terms is considered, resulting in a triplet analysis (i.e., aspectopinion-sentiment) of ABSA (Peng et al., 2020;Chen et al., 2021).The latest trend of ABSA has been upgraded into the quadruple form by adding the category element into the triplet ABSA (Cai et al., 2021;Zhang et al., 2021a).The quadruple ABSA promisingly completes the ABSA definition and helps the comprehensive understanding of the opinion picture.
Yet we notice that all the current ABSA research is confined to the scenario of a single piece of text (i.e., sentence or document).For example, currently the most popular ABSA benchmark, Se-mEval (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016)), comes with only sentence-level annotations.This may limit the application of ABSA.Essentially, in the real-world environment ABSA has a broader application under dialogue contexts.For example, people are more likely to discuss certain products, services, or politics on social media (e.g., Twitter, Facebook, Weibo) in the form of multi-turn and multi-party conversations.Also, it is practically meaningful to develop sentiment-support dialog systems to facilitate the clinical diagnosis, and treatment (Liu et al., 2021a).Unfortunately, no effort has been dedicated to the research of a holistic dialog-level ABSA.
In this paper, we consider filling the gap of dialogue-level ABSA.We follow the line of recent quadruple ABSA and present a task of conversational aspect-based sentiment quadruple analysis, namely DiaASQ.DiaASQ sets the goal to detect the fine-grained sentiment quadruple of targetaspect-opinion-sentiment given a conversation text, i.e., an opinion of sentiment polarity has been expressed toward the target with respect to the aspect.As exemplified in Fig. 1, multiple users (speakers) on social media discuss different angles of a product (i.e., 'Xiaomi' brand cellphone) in dialogue threads of multiple turns.The task aims to extract three quadruples over the dialog: ('Xiaomi 11', 'WiFi module', 'bad design', 'negative'), ('Xiaomi 11', 'battery life', 'not well', 'negative') and ('Xiaomi 6', 'screen quality', 'very nice', 'positive').
To benchmark the task, we manually annotate a large-scale DiaASQ dataset.We collect millions of conversational corpus of source comments and discussions closely related to electronic products from Chinese social media.We hire well-trained workers to explicitly label the DiaASQ data (i.e., the elements of quadruples, targets, aspects, opinions, and sentiments) based on the crowd-sourcing technique, which ensures a high quality of annotations.Finally, we yield the dataset with 1,000 dialogue snippets in total with 7,452 utterances.To facilitate the multilinguality of the benchmark, we further translate and project the annotations into English.Data statistics show that each dialog involves around 5 speakers, and 22.2% of the quadruples are in the cross-utterance format.
Compared with previous single-text-based ABSA, DiaASQ challenges in two main aspects.First, DiaASQ includes four subtasks.Directly applying the existing best-performing graph-based ABSA model to enumerate all possible target, aspect, and opinion terms could cause a combinatorial explosion.Second, the elements of a quadru- Table 1: A comparison between our DiaASQ dataset and existing popular ABSA datasets, including: ASTE (Peng et al., 2020), TOWE (Fan et al., 2019), MAMS (Jiang et al., 2019), and CASA (Song et al., 2022).
ple are scattered around the whole conversation due to the complex replying structure, which requires the model to do cross-utterance extraction.
To solve these challenges, we present an end-toend DiaASQ framework.Specifically, based on the grid-filling method (Wu et al., 2020), we redesign the tagging scheme to fulfill the four subtasks in one shot effectively.Moreover, during the dialogue text encoding, we additionally model the dialogue-specific representations for utterance interaction and meanwhile encode the relative distance as cross-utterance features.Experiments on the DiaASQ data indicate that our model shows significant superiority than several strong baselines.
To sum up, this work contributes in threefold: • We pioneer the research of dialogue-level aspect-based sentiment analysis.Specifically, we introduce a conversational aspect-based sentiment quadruple analysis (DiaASQ) task.• We release a dataset for the DiaASQ task in both Chinese and English languages, which is of high quality and at a large scale.• We introduce a model to benchmark the Di-aASQ task.Our method solves the task end-to-end and meanwhile effectively learns the dialogue-specific features for better crossutterance sentiment quadruple extraction.
2 Related Work

Fine-grained Sentiment Analysis
All the existing ABSA tasks and their derivations revolve around predicting several elements or combinations: aspect term, sentiment polarity, opinion term, aspect category1 , target.The initial ABSA task aims to classify the sentiment polarities given aspects (Tang et al., 2016;Fan et al., 2018;Li et al., 2019).Later, a wide range of new compound ABSA-related tasks is proposed, such as aspect-opinion paired extraction (Zhao et al., 2020;Wu et al., 2021), aspect-category prediction (Wang et al., 2019;Jiang et al., 2019;Dai et al., 2020), triplet extraction (Peng et al., 2020;Chen et al., 2021Chen et al., , 2022b)), and structured opinion extraction (Shi et al., 2022;Wu et al., 2022), etc.The latest attention has been placed on the quadruple or quintuple ABSA, where the aspect category element is added into the triplet extraction (Cai et al., 2021;Zhang et al., 2021a;Liu et al., 2021b;Fei et al., 2022a).Compared to all prior ABSA tasks, the sentiment quadruples provide much more complete opinion details that can facilitate downstream applications better.In this work, we follow this line, while our work differs in three aspects.First, we consider adding the element of target instead of category.Second, current quadruple and quintuple ABSA datasets all are incrementally annotated based on the existing SemEval data (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016)); while we newly craft our data from real-world environment.Third, this work mainly focuses on the conversation contexts instead of sentence pieces.

Dialogue Opinion Mining
In NLP community, dialogue applications show increasing impacts to real-world environments (Liao et al., 2021;Ni et al., 2022;Liao et al., 2022).The emotion and sentiment analysis in conversation scenarios is an essential branch of opinion mining.Previous dialogue-level opinion mining has been limited to the coarse granularity, where the representative task is dialogue emotion detection (Li et al., 2020;Hu et al., 2021;Li et al., 2022).Yet as we indicated earlier, sentiment analysis in conversation at a fine-grained level has practical value.In this paper, we pioneer the research of dialoguelevel ABSA, presenting the conversational aspectbased sentiment quadruple analysis task.
In Table 1 we compare our DiaASQ data with existing popular ASBA benchmarks.It is worth noticing that, although CASA (Song et al., 2022) is a dialogue-level sentiment analysis dataset, it may fail to provide a comprehensive understanding of opinion status due to the absence of key elements (e.g., aspect).In contrast, our DiaASQ dataset covers target, aspect, opinion and sentiment, which is by now the most comprehensive ABSA benchmark among all the other corpus.In addition, sentiment understanding in DiaASQ is more complex and thus more challenging.For example, one aspect term could correspond to multiple sentiments.Be- Figure 2: The tree-like dialogue replying structure.
sides, DiaASQ contains both Chinese and English versions, which will facilitate the research community to different languages.

Data Construction
We construct a new dataset to facilitate the DiaASQ task.The raw corpus is collected from the largest Chinese social media, Weibo2 .We crawl nine million posts and comments from the tweets history of 100 verified digital bloggers.Each conversation is derived from a root post, and multiple users (i.e., multiple speakers) are attended to reply to a predecessor post.The multi-thread and multi-turn dialogue forms a tree structure, as illustrated in Fig. 2. We preprocess the raw dialogues to make the contexts integrated.First, we filter the topicrelated conversations by a manually created keyword dictionary in the mobile phone field, which includes hundreds of hot words, like phone band names, aspects words to describe a mobile phone, etc.Then, we normalize the tweet language expressions (e.g., abusive language, hate speech) by human examination or consulting lexicons; we prune away those meaningless replying branches that deviate too much from the main topic.We also limit the maximum number of utterances to ten for better controllable modeling.After a strict cleaning procedure, we obtain the final 1,000 dialogues.
During the annotation stage, all the conversation texts are labeled with a team of crowd-workers who are pre-trained under the SemEval ABSA (Pontiki et al., 2014) annotation guideline3 .Also, the linguistic and computer science experts inspect the labeling schema.After annotating, annotators are required to cross-examine the labels.Also, some automatic rules are applied to verify the labeling consistency.Finally, Cohen's Kappa score of quadruples is 0.86, which indicates our annotation corpus has reached a high-level agreement.
Data Insights.We randomly split the conversation snippets into train/valid/test sets, in the ra- Figure 3: The ratio of cross-utterance quadruples.We define the max utterance-level distance between every two items in one quadruple as the number of crossutterance.For example, the first quadruple in Fig. 1 crosses two utterances.tio of 8:1:1.The Chinese version of the dataset contains a total of 7,452 utterances, and 5,742 sentiment quadruples, while the English version contains 5,514 quadruples, which are far bigger numbers than the existing quadruple and quintuple ABSA datasets (Cai et al., 2021;Zhang et al., 2021a).Also, there is an average of one sentimental expression in each utterance.Such annotation density makes it quite convenient for task prediction.The data statistics are shown in Table 2.Each dialog has around five speakers on average, and the dataset contains 1,275 (22.2%, in Chinese) and 1,227 (22.3%, in English) cross-utterance quadruples, respectively.In Fig. 3, we show the ratio of quadruples of the dataset under different crossutterance levels.More data statistics are shown in Appendix § B.

Grid-tagging Task Modeling with Renewed Label Scheme
The input of the DiaASQ task includes a dialogue , where l i denotes that i-th utterance replies to l i -th utterance. Each th utterance text and m is the length of utterance u i .The replying record l reflects the hierarchical tree structure of D. Based on the input D and l, DiaASQ aims to extract all possible (target, aspect, opinion, sentiment) quadruples, denoted as DiaASQ naturally includes four subtasks.Different popular end-to-end ABSA systems can be utilized to solve our DiaASQ, such as the graphbased (Zhou et al., 2021;Chen et al., 2022a), seqto-seq (Zhang et al., 2021c;Mukherjee et al., 2021) and grid-tagging models (Wu et al., 2020).Yet enumerating all possible terms with graph-based methods will cost computational efficiency, while seqto-seq methods suffer from exposure bias.The gridtagging method advances in higher efficiency, i.e., O(n 2 ) complexity, where n denotes the sequence length.However, the labeling scheme in (Cai et al., 2021;Zhang et al., 2021a) only supports term-pair extraction (i.e., aspect and opinion terms), which fails to directly solve our DiaASQ that requires term-triple extraction (i.e., target, aspect and opinion terms).Here we inherit the success of the gridtagging method for an end-to-end solution and redesign the labeling scheme to fit our needs.
To reach the goal, we re-decompose the task into three joint jobs: detections of the entity boundary, entity pair, and sentiment polarity.We renew the labeling scheme of grid-tagging in support of these jobs, which is shown in Fig. 4.
• Entity Boundary Labels: We use tgt, asp, opi to denote the token-level relations between the head and tail of a target, aspect, and opinion term, respectively.For example, the tgt between 'Xiaomi' and '6' denotes a target term of 'Xiaomi 6'.
• Entity Pair Labels: We then need to link different types of terms together as a combination.To represent the relation between entities, we devise two labels: h2h and t2t, both of which align the head and tail tokens between a pair of entities in two types.For example, the head words of 'Xiaomi' (target) and 'screen' (aspect) is connected with h2h, while the tail words of '6' (target) and 'quality' (aspect) is connected with t2t.By labeling a chain of term pairs in different types, we form a triplet of (t k , a k , o k ).
• Sentiment Polarity Labels: By adding a sentiment category label p k , we then form a quad q k =(t k , a k , o k , p k ).Since the target and opinion terms together determine a unique sentiment, we assign the category label between the heads and tails of these two terms, as shown in Fig. 4.

DiaASQ Model
We present a DiaASQ model to accomplish the task based on the above grid-tagging label scheme.Fig. 5 shows the overall architecture.

Base Encoding
We adopt a pre-trained language model (PLM), e.g., BERT (Devlin et al., 2019), to encode the dialogue utterances.However, the length of a whole dialogue may far exceed the max length that BERT can accept.We thus encode each utterance with a separate PLM one by one.We use the [CLS] and [SEP] tokens to separate each utterance u i .
) where h m is the contextual representation of w m .

Dialogue-specific Multi-view Interaction
To strengthen the awareness of the dialogue discourse, we then introduce a multi-view interaction layer to learn the dialogue-specific features.This layer is built upon the multi-head selfattention (Vaswani et al., 2017).Inspired by (Shen et al., 2021;Zhao et al., 2022), we use three types of features: dialogue threads, speakers, and replying.Specifically, we realize the idea by constructing attention masks M c that carry the bias of such prior features, controlling the interactions between tokens.And c ∈ {T h, Sp, Rp} represents different types of token interaction, i.e., thread, speaker, and replying, respectively.
where Q=K=V =H ∈ R N ×d is the representation of the whole dialogue sequence obtained by concatenating token representations of each utterance (H i in Eq. ( 2)), N is the token-level length of D, and ⊙ is element-wise production.The value of M c ∈ R N ×N is defined as follows: • Thread Mask: M T h ij =1 if the i th and j t h token belong to the same dialogue thread.
• Speaker Mask: M Sp ij =1 if the i th and j th token are derived from the same speaker.
• Reply Mask: M Rp ij =1 if the two utterances containing the i th and j th token respectively have a replying relation.
We then conduct Max-Pooling over the masked representations, followed by a tag-wise MLP layer to yield the final feature representation v c i : indicates a specific label, and ϵ ent denotes the non-relation label in the entity boundary matrix.

Integrating Dialogue Relative Distance
Limited by the PLM, we can only encode each utterance separately, potentially hurting the conversational discourse.To compensate for it, we consider fusing the Rotary Position Embedding (RoPE) (Su et al., 2021)  Figure 5: The overall framework of our DiaASQ model.First, the base encoder learns base contextual representations for the input dialogue texts.The multi-view interaction layer then aggregates dialogue-specific feature representations, such as the threads, speakers, and replying information.We further fuse the Rotary Position Embedding (RoPE), where the relative dialogue distance information helps guide better discourse understanding.Finally, the system decodes all the quadruples based on the grid-tagging labels.discourse understanding.
where R(θ, i) is a positioning matrix parameterized by θ and the absolute index i of v r i .

Quadruple Decoding
Based on each tag-wise representation u r i , we finally calculate the unary score between any token pair in terms of label r: where s r ij is the probability that the relation label between w i and w j is r.Then we put a softmax layer over all elements in each matrix to determine the relation label r.For example, the probability of entity boundary matrix can be obtained via: ) .(8) Obtaining all the labels in the grid, we decode all the quadruples based on the rules stated in § 4.

Learning
The training target is to minimize the cross-entropy loss of each subtask: where k ∈ {ent, pair, pol} indicates a subtask, N is the total token length in a dialogue, and G is the total training data instances.y k ij is ground-truth label, p k ij is the prediction.The label types (stated in Section 4) are imbalanced.Thus we apply a tag-wise weighting vector α k to counteract this.We then add up all three loss items as the final one: 6 Experiment

Settings
We conduct experiments on our DiaASQ dataset to evaluate the efficacy of our proposed model.We mainly measure the performances in terms of three angles: 1) span match: the boundary of three types of term spans; 2) pair extraction: the detection of span pair, i.e., Target-Aspect, Aspect-Opinion and Target-Opinion; 3) quadruple extraction: recognizing the full quad of DiaSAQ task.We use the exact F1 as the metric: for span, a correct prediction should match both the left and right boundaries; for pair, match both two spans and the relation; for quad, match all four elements exactly.The performance of quadruple extraction is our main focus.
We thus take the micro F1 and identification F1 respectively for measurements, where the micro F1 measures the whole quad, including the sentiment polarity.In contrast, identification F1 (Barnes et al., 2021) does not distinguish the polarity.We take the Chinese-Roberta-wwm-base (Cui et al., 2021) and Roberta-Large (Liu et al., 2019) as our base encoders for the Chinese and English datasets, respectively.We put a 0.2 dropout rate on the BERT output representations.MLP in Eq. ( 5) has a 64-d hidden size.The testing results are given by the models tuned on the developing set.All experiments take five different random seeds, and the final scores are averaged over five runs.
As no prior method is deliberately designed for DiaASQ, we consider re-implementing several strong-performing systems closely related to the task as our baselines, including CRF-Extract-Classify (Cai et al., 2021), SpERT (Eberts and Ulges, 2020) Span-ASTE (Xu et al., 2021)  ParaPhrase (Zhang et al., 2021a).All baselines take the same PLM as used in our model except that ParaPhrase uses mT5-base (Xue et al., 2021).

Main Comparisons
Table 3 compares the performances of different models on the DiaASQ task.We see that our proposed method achieves the overall best results under almost all measurements.Besides, we have the following observations.First, the performance divergences of different models on span detection are not significant, and all the methods perform well on the subtask.We think this is mainly because, without considering the inter-relation between each type of term (T/A/O), recognizing the mentions is a pretty simple task.
Second, it is clear that our model starts surpassing the baselines on pair-wise detection.Our system outperforms the second-best models over average 9% of F1 score on almost all cases, i.e. , T-A,  T-O, and A-O.This result verifies that our model is more effective than baselines on sentiment information extraction under the conversational scenario.One exception is that the Span-ASTE slightly exceeds our model on A-O pair extraction in the English version dataset.The possible reason is that aspect and opinion pair usually co-occur closely, and it has been a classical task for which span-aste can achieve competitive results.
Finally and most importantly, our system shows huge wins on the quadruple extraction, with 7.52% micro F1(=34.94-27.42)and 6.66% identification F1(=37.51-30.85)improvements on the Chinese dataset, with 6.32% micro F1(=33.31-26.99)and 8.46% identification F1(=36.80-28.34)improvements on the English dataset, respectively.This result evidently shows our model's efficacy on the task.We also find that stripping off the PLMs hurts the task performances very prominently, even for the strong models.

Ablation Study
We now take a further step, examining the efficacy of several key designs in our method, including the dialogue-specific multi-view interaction, the relative distance embedding (RoPE), and the labelwise weighting mechanism.The ablating results are shown in Table 4.
First, we see that the different type of dialoguespecific interaction shows the varying influence.For example, thread features show the overall most negligible impacts, which improve the F1 score of Inter-Utt by no more than 1% in the two datasets.In contrast, the speaker-aware and reply-aware interactions are more important that improve the score Inter-Utt by more than 1%.Interestingly, some ablations increase the performances in the intrautterance case but decrease rapidly in the crossutterance case.
Then, we witness the most significant performance drops when removing the RoPE feature.Significantly, the F1 score of cross-utterance drops 2.99% and 3.54% in the Chinese and English datasets, respectively.This result demonstrates the importance of modeling dialogue-level discourse information.Finally, we see that the label-wise  weighting mechanism used for task learning is also much crucial.This finding is reasonable because the labels of different types in the grid among the whole dialogue are imbalanced and sparse, e.g., the positive tags are far less than the negative ones (i.e., ϵ ent ).Label-wise weighting helps effectively solve the label imbalance issue.

Further Analysis
In this section, we consider diving into the model performances and carry on an in-depth analysis to better understand the strengths of our method.Cross-utterance Quadruple Extraction.Earlier in Table 3, we verify the superiority of our model.We mainly credit its capability to effectively model the cross-utterance features.Here we directly examine this attribute by observing the performances under different levels of the cross-utterance quad extraction.As plotted in Fig. 6, we observe the patterns that the more utterances quadruple across, the lower the performances all models can achieve.Especially when the cross-utterance level ≥3, the baseline systems fail to recognize any single quad.Nevertheless, our system can still well resolve the challenge, even in case of cross-≥3-utterance.Also, by comparing two of our ablated models, we learn that the dialogue-specific interaction features are more beneficial for handling the super-long-  distance cross-utterance.But the RoPE that carries discourse information contributes more to the shortrange case (i.e., cross-1-utterance).
Impact of Dialogue-level Distance Encoding.We equip our framework with dialogue-level relative distance embeddings (i.e., RoPE, a dynamic positioning feature), so as to enhance conversational discourse understanding.Here we study the influence of using different dialogue-level distance embeddings.We consider two other alternative solutions: 1) Relative position encoding, which is a type of dense embedding of relative distances of utterance; We directly add the embedding to the token relation probability vector in Eq. ( 8) to introduce this information.2) Global position encoding, which is an absolute position embedding of the token.We utilize the global position by adding to the token representation v r i in Eq. ( 5).We study the performance changes on quadruple extraction by using the alternatives, as shown in Fig. 7.We see that the Global position strategy shows the lowest helpfulness consistently, compared to the relative position methods.This finding suggests that relative distance may be more helpful in modeling the conversation discourse.Moreover, the RoPE gives the best usefulness, especially under inter-utterance cases.Intuitively, such dynamic position information offers more flexible bridging knowledge for easing the long-range dependence is-sue of term pairing where the entities are separated in different utterances in distance.

What To Do Next?
In this work, we propose an initial method to solve the DiaASQ task.Although achieving stronger performances than baselines, it could be further benefited from many angles.To facilitate the follow-up research in this direction, we try to shed light on several potential future works.
▶ Making Better Use of The Dialogue Discourse Structure Information.The core challenge of the DiaASQ task lies in handling conversation contexts.Compared to the typical case of single sentences, the dialogue utterances are syntactically disjoint.Thus, it is critical to carefully model the dialogue discourse structure information (Fei et al., 2022b), so as to better capture the dialogue semantics, for better recognition of the cross-utterance quadruples.Although we leverage the dialogue relative distance information (RoPE) in this work, without treating the dialogue utterances as a whole, our method may still lose some important discourse information.As seen in Fig. 6, our model's performance on the super cross-utterance quads is still far from satisfaction, i.e., zero F1 score on the cross->3utterance case.Intuitively, constructing an explicit conversational discourse structure (i.e., the tree or graph structure) for the task is promising.
▶ Enhancing Coreference Resolution.In the conversation scenario, the speaker and target coreference is one of the biggest issues.In the DiaASQ task, the bundled sentiment elements (e.g., target, aspect, and opinion term) of one quad may be yielded by different users or maybe one individual.Besides, the sentiment terms may be coreferred by pronouns, for example, 'the screen quality of it' where 'it' refers to the target term 'Xiaomi 6' mentioned in the previous context.Without correctly understanding the coreference, it is problematic for a system to precisely capture the context semantics, and thus leads to a wrong pairing between sentiment elements and unexplainable predictions.
▶ Extracting Overlapped Quadruple.It is common in our DiaASQ dataset that one sentiment term of one quad overlaps with other terms of another quad.For example, different electronic devices (targets) may have the same aspects, e.g., battery life, screen or size, etc.A sound DiaASQ system should also well solve the quadruple overlap issue.We note that the overlapped quads can essentially share certain structural information, and thus it is favorable to use such shared knowledge effectively.
▶ Transferring Well-learned Sentiment Knowledge from Existing System.The sentiment analysis community has developed a great amount of powerful ABSA systems well-trained on the largescale free texts or existing sentiment corpora (Xu et al., 2020;Tian et al., 2020;Li et al., 2021).Since this work still inherits the basic spirit of ABSA, it is naturally a promising idea to transfer the existing well-trained sentiment-enriched ABSA model for enhancing the understanding of the DiaASQ task.
▶ Multi-/Cross-lingual Dialogue ABSA.One of the key challenges for more accurate multi-/crosslingual ABSA is the missing of parallel annotations in different languages, i.e., causing troubles for label alignments (Feng and Wan, 2019;Fei and Li, 2020;Zhang et al., 2021b).As we annotate the DiaASQ dataset in two languages (i.e., Chinese and English) with parallel sentences, this paves the way for the research of more effective multi-lingual or cross-lingual dialogue-level ABSA.

Conclusion
This work introduces a new task of conversational aspect-based sentiment quadruple analysis, namely DiaASQ, which aims to detect the sentiment quadruple of target-aspect-opinion-sentiment structure in the conversation texts.DiaASQ bridges the gap between conversational opinion mining and fine-grained sentiment analysis.We manually construct a large-scale, high-quality dataset with Chinese and English versions for the task, with 1,000 dialogue snippets, including 7,452 utterances.We then benchmark the DiaASQ task with an end-toend neural model, which effectively models the dialogue utterance interactions.Experiments demonstrate the advantages of our method in effectively learning the dialogue-specific features for better cross-utterance sentiment quadruple extraction.

Limitations
Our paper has the following potential limitations.First, our current DiaASQ dataset is limited to only the domain of digital devices.We plan to further extend the DiaASQ texts to other domains, e.g., foods/restaurants, hotel/trips, etc.Secondly, our proposed model may be limited to insufficient modeling of the dialogue-level discourse structure information, which would somehow prevent us from obtaining further task improvements.Third, in Di-aASQ task, it is more difficult to recognize the opinion terms, compared to the extraction of target and aspect terms.This may largely deteriorate the overall performance due to the fact that opinion expressions are much more flexible and sometimes are subject to satirical expression.

A Model and Setup Specification
Algorithm 1 Calculating global indices of tokens in two threads Require: P t ; Two thread T i , T j , where i, j are thread id.if i * j == 0 or i==j then Distance Encoding Details.In our data, tokens may distribute in different dialogue threads.Therefore, the relative distance between tokens cannot be calculated by subtracting their absolute position ids.However, the RoPE uses the global index to represent relative distance: , where m and n are the absolute positions of two tokens.And m-n is not the relative distance of two tokens, which means the RoPE can not work typically.Therefore, we develop a method to calculate the relative token distance for different thread pairs.In detail, for each token t, we define the distance between t and the root node as its local position id P t .For each two threads T i and T j , the absolute positions to represent their relative distance can be calculated by the Algorithm 1.
For example, as the block with different color shows in Fig. 8, we can see that (P ij t − P ij t ′ )(t ∈ T i , t ′ ∈ T j ) is the relative distance of t and t ′ .Then we use the calculated P ij t as the absolute position to perform the RoPE operation.

Specification of Baselines.
As no prior method is deliberately designed for DiaASQ, we consider reimplementing several strong-performing systems closely related to the task as our baselines.Here we give a complete description on these baseline systems.
• CRF-Extract-Classify is a three-stage system (extract, filter, and combine) proposed for the sentence-level quadruple ABSA by Cai et al. (2021).Here we retrofit the model to further support target term extraction.• SpERT is proposed by Eberts and Ulges (2020) for joint extraction of entity and relation based on a span-based transformer.Here we slightly modify the model to support tripleterm extraction and polarity classification.• Span-ASTE is a span-based approach for triplet ABSA extraction (Xu et al., 2021).Similarly, we change it to be compatible with the DiaASQ task by editing the last stage of Span-ASTE to enumerate triplets.• ParaPhrase is a generative seq-to-seq model for the quadruple ABSA extraction (Zhang et al., 2021a).We modify the model outputs to adapt to our DiaASQ task.In particular, ParaPhrase (Zhang et al., 2021a) is a generative model proposed for the quadruple ABSA task.We re-implement the model and modify the output to fit it with our task.In short, given the source dialogue, we expect the model to output a sentiment-aware string: "Target is great/bad/ok, because the Aspect of it is Opinion ...", where Target/Aspect/Opinion is a term palaceholder, and greate/bad/ok is an opinionated expression indicating the specific sentiment polarity, i.e., positive/negative/other.For the dialogue in Fig. 1

B Extended Data Specification
Polarity Distribution.We statistics the polarity of quadruples in both Chinese and English datasets.
As illustrated in Fig. 9, most of the quadruple express the clear sentiment tendency, which is constituent with the users' speaking habits on social media.Then, the positive and negative sentiment rates are near, indicating that our data sampling is balanced.Furthermore, due to the left three polarity being quite a few in our dataset, we merge them as a new category, others, for the convenience of extraction.
Cross-utterance Quadruples.We also analyzed the categories and numbers of cross-utterance quadruples.As shown in Fig. 10   ping between each others in our DiaASQ dataset, which is not specially described in our main article due to space limitations.As the first case shown in    6.

C Specification on Data Construction
This part describes the details that we constructed the DiaASQ dataset, including the data acquisition and annotation projection.

C.1 Data Acquisition
Fig. 12 illustrates the overall workflow that we obtain a high-quality original corpus from social media.First, based on the official leaderboard, we collected the top 100 influential digital-domain bloggers on Weibo and crawl their history tweets as many as possible.Meanwhile, we also built a mobile-phone related keywords library and crawled tweets and their comment searched by these keywords.After this step, we obtained nearly 9 million tweets and comments, and the replying relation was also recorded.Then, we conduct a preliminary screening to exclude posts with less than ten replies or no father node.About 1.2 million posts were retained after this procedure.Next, according to the replies relation, we combine these posts into dialogue trees, whose root nodes are level-1 comments below each primary tweet, and the maximum depth is no more than 4. Based on the phonerelated keywords and a collection of abusive words, a more strict filtering rule is performed at the tree level.In detail, a thread will be kept if it contains any two of the phone-related keywords and does not contain any abusive words.Once a dialogue tree has three valid threads and the total number of nodes is between 6 and 10, it will be selected as the candidate dialogue.Around 6,000 dialogue The optimization of the Xiaomi 8 series is not so bad , but its optimization is very comfortable .trees are left after these steps.Finally, we manually checked the candidate dialogue, and dialogues that are indeed phone-related and do have no ethical issue are selected as the final corpus.After a very rigorous processing, we obtained 1,000 pieces of high-quality tree-like dialogues.

C.2 Parallel-Language Data Construction
We also constructed an English version dataset based on the Chinese corpus via the annotation projection method.Following Fei et al. (2020) and Zhen et al. (2021), the entire process contains two steps: text translation and annotation projection.We manually revise the process result after each step to ensure the corpus quality.
Step1: Text Translation We first utilize Google Translate API to translate the Chinese text into English. 5Despite the stunning performance of 5 https://cloud.google.com/translateNMT(neural machine translation), it still makes some mistakes during translation.The main reason is that our corpus is collected from social media and full of non-grammatical sentences, which has brought challenges for the NMT system to generate correct and elegant translations.Therefore, we carefully revise the translation to eliminate errors and meanwhile improve readability.Table 7 lists one of the errors and revision results.
Step2: Annotation Projection Then we conduct projection to obtain English versions corpus based on original Chinese annotation.Specifically, we achieve corpus projection with the help of awesome-align (Dou and Neubig, 2021), an excellent alignment tool based on large-scale multilingual language models.We found that the alignment tool is not good at aligning named entities, and a representative error and correction are shown in Fig. 13.After manually correcting all of the projection results, we obtained the final annotated corpus.

C.3 Data Instances
In Table 8 we illustrate a full piece of data instance (a conversation) with our annotation (English version is shown).
al., * Corresponding author.A 4-year holder of Xiaomi 6 is here!I regret buying Xiaomi 11. # What do you think of Xiaomi mobile phone # A Here I am! Rabbit has seen your issues and please check your private message.the screen quality of it is very nice! 2) Corresponding aspect-based quadruples 1) A snippet of dialogue That's right, and as far as I've experienced, WiFi module is also a bad design.t buy since my friend said the battery life of Xiaomi 11 is not well.

Figure 1 :
Figure 1: Illustration of the conversational aspect-based sentiment quadruple analysis (DiaASQ).The dialogue utterances produced by the corresponding speakers (marked at left) are organized into replying structure.

Figure 8 :
Figure8: Relative token distance calculation over the dialogue tree structure.To simplify the problem, we assume that one utterance has only one token.

Figure 10 :
Figure10: The number of different types of crossutterance quadruples, whose elements at least come from two different utterances.'Replying' denotes that the two utterances have replying relationship and 'Same speaker' indicates the two utterances spoken by the same person.'Same thread' denotes the two utterances belonging to the same dialogue thread.'Others' mainly contains very rare cases, e.g., one quadruple contains elements from different threads.

Fig. 11 ,
Fig. 11, two quadruples may contain the same target and opinion term.The overlap information can actually provide valuable clues for better extraction.Here we show in Fig. 11 all types of overlap cases of the Chinese version dataset and their statistics information in Table6.

Figure 13 :
Figure 13: A example for projection correction.The red dotted line denote manually added alignment relation.
Since some words will be added, dropped, or merged during the translating process, the numbers of annotation items in Chinese and English versions of datasets are somewhat different.
into token representations.RoPE dynamically encodes the relative distance globally between utterances at the dialogue level.Introducing such distance information can help guide better

Table 3 :
and Main results of the DiaASQ task.'T/A/O' represent Target/Aspect/Opinion, respectively.All the scores are averaged values over five runs under different random seeds.Since ParaPhrase and Span-ASTE do not distinguish the term types, we here do not measure the performances of span match.Note that 'w/o PLM' indicates that we use randomly initialized word2vec to encode the text.
Figure 6: results on different cross-utterance levels.

Table 5 :
Xiaomi 6 is great because the screen quality of it is very nice.Detail of the hyper-parameter setting.Hyper-Parameters.Here we detail the experimental setups.The testing results are shown by our model tuned on the developing set to achieve the best developing performances.Hyper-parameters are listed in Table5.We adopt AdamW as BERT optimizer.Our model is implemented with PyTorch and trained on the Ubuntu-20.04OS with the Intel i9 CPU and NVIDIA RTX 3090 GPU.
The screen of Xiaomi and the battery of Apple are their merits.Figure 11: Quadruple overlap in our DiaASQ dataset, including a total of six cases.

Table 6 :
Statistics of overlapped quadruples.The second column of each row is the index of the subplot in Fig.11.

Table 7 :
Two typical translation revision examples.The first one is token-level translation error correction.And the second one shows a more proper statement.