Seeking Common but Distinguishing Difference, A Joint Aspect-based Sentiment Analysis Model

Aspect-based sentiment analysis (ABSA) task consists of three typical subtasks: aspect term extraction, opinion term extraction, and sentiment polarity classification. These three subtasks are usually performed jointly to save resources and reduce the error propagation in the pipeline. However, most of the existing joint models only focus on the benefits of encoder sharing between subtasks but ignore the difference. Therefore, we propose a joint ABSA model, which not only enjoys the benefits of encoder sharing but also focuses on the difference to improve the effectiveness of the model. In detail, we introduce a dual-encoder design, in which a pair encoder especially focuses on candidate aspect-opinion pair classification, and the original encoder keeps attention on sequence labeling. Empirical results show that our proposed model shows robustness and significantly outperforms the previous state-of-the-art on four benchmark datasets.


Introduction
Sentiment analysis is a task that aims to retrieve the sentiment polarity based on three levels of granularities: document level, sentence level, and entity and aspect level (Liu, 2012), which is under the urgent demands of several society scenarios (Preethi et al., 2017;Cobos et al., 2019;Islam and Zibran, 2017;Novielli et al., 2018). Recently, the aspect-based sentiment analysis (ABSA) task (Pontiki et al., 2014), focusing on excavating the specific aspect from an annotated review, has aroused much attention from researchers, in which this paper mainly concerns the aspect/opinion term extraction and sentiment classification task. The latest benchmark proposed by Peng et al. (2020) formulates the relevant information into a triplet: target aspect object, opinion clue, and sentiment polarity orientation. Thus, the concerned aspect term extraction becomes a task of Aspect Sentiment Triplet * Corresponding author.
The view is spectacular, and the food is great. Extraction (ASTE). Similarly, the relevant information is formulated into a pair with aspect term and sentiment polarity, and the task is defined as Aspect Term Extraction and Sentiment Classification (AESC). Figure 1 shows an example of ASTE and AESC. Two early methods handle the triplet extraction task efficiently (Zhang et al., 2020a;Huang et al., 2021). Both are typically composed of a sequence representation layer to predict the aspect/opinion term mentions and a classification layer to infer the sentiment polarity of the predicted mention pair of the last layer. However, as is often the case, such model design may easily result in that the errors of the upper prediction layer would hurt the accuracy of the lower layer during the training procedure.
To tackle the error cascading phenomenon on the pipeline model, a growing trend of jointly modeling these subtasks in one shot appears.  proposed a joint model using a sequence tagging method, based on the bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF). However, they found that if a tagged mention belongs to more than one triplet, this method will be ineffective. Zhang et al. (2020a) proposed a multi-task learning approach with the aid of dependency parsing on tail word pair of corresponding aspect-opinion pair. However, this non-strict dependency parsing may miss capturing structural information of term span. Meanwhile, the many-target to one-opinion issue is not effectively handled.
The promising results achieved by machine reading comprehension (MRC) frameworks on solving many other NLP tasks (Li et al., , 2019a) also inspires the ASTE task. Mao et al. (2021) and Chen et al. (2021) attempted to design question-answer pair in terms of MRC to formulate the triplet extraction. Nevertheless, both need to make the converted question correspond one-to-one to the designed question manually, hence increasing computation complexity.
Among these joint models,  transformed the sequence representation into the two-dimension space and argued that the word-pair under at least one assumption could represent the aspect-opinion pair as input of different encoders. Although this model indicated significant improvement, it treated the word-pair without taking span boundary of aspect term and opinion term into consideration and incorporated nonexistent pre-defined aspect-opinion pairs.
Considering the problems mentioned above, we propose a dual-encoder model based on a pretrained language model by jointly fine-tuning multiple encoders on the ABSA task. Similar to prior work, our framework uses a shared sequence encoder to represent the aspect terms and opinion terms in the same embedding space. Moreover, we introduce a pair encoder to represent the aspectopinion pair on the span level. Thus, our dualencoder model could learn from the ABSA subtasks individually and benefit from each other in an end-to-end manner.
Experiments on benchmark datasets show that our model significantly outperforms previous approaches at the aspect level. We also conduct a series of experiments to analyze the gain of additional representation from the proposed dualencoder structure.
The contributions of our work are as follows: • We propose a jointly optimized dual-encoder model for ABSA to boost the performance of ABSA tasks.
• We apply an attention mechanism to allow information transfer between words to promote the model to know the word pairs before inference.
• We achieve state-of-the-art performance on benchmark datasets at the time of submission.

Problem Formulation
In this paper, we split the ABSA task into two periods: aspect/opinion term extraction and sentiment classification (SC), as shown in Figure 1. The aspect/opinion term extraction subtask extracts the aspect terms (AT) and opinion terms (OT) in the sentences without considering the sentiment polarities (SP). Furthermore, according to the sentiment polarity tagging style of the dataset, the SC subtask is divided into two categories: ASTE, tagging SP on AT and OT, and AESC, which tags SP only on AT.
In particular, we denote AT, OT and SP as the set of predefined aspect terms, opinion terms and sentiment polarities, respectively, where AT ∈ AT, OT ∈ OT, and SP ∈ SP = {POS, NEU, NEG}. Given a sentence s consisting of n tokens ω 1 , ω 2 , ..., ω n , we denote T as the sentence output of our model. Specifically, for the ASTE task, T = {(AT, OT, SP )}, and for the AESC task, T = {(AT, SP )}.

Model Overview
Inspired by the work of Wang and Lu (2020) which utilize the dual-encoder structure, our approach for the ABSA task is designed to subtly modeling high affinity between aspect/opinion pair and ground truth by effectively leveraging the pair representation. As shown in Figure 2, our dual-encoder comprises two modules: (1) a sequence encoder, a Transformer network initialized with the pretrained language model to represent AT and OT with the corresponding context. (2) a pair encoder, encoding the aspect-opinion pair (for ASTE) or aspect-aspect pair (for AESC) for each sentiment polarity.

Token Representation
For a length-n input sentence s = ω 1 , ..., ω n , besides the word-level representation x word , we also feed the characters of the word into the LSTM to generate the character-level representation x char . Additionally, the pre-trained language model provides the contextualized representation x plm . Finally, we concatenate these three representations of each word to feed into the dual-encoder: In our proposed dual-encoder architecture, we still treat the ASTE/AESC task as a unified sequence tagging task in previous work: for a given sentence s, where AT and OT on the main diagonal are annotated with B/I/O (Begin, Inside, Outside), each entry E i,j of the upper triangular matrix denotes the pair (ω i , ω j ) from the input sentence. Our work is partially motivated by  but significantly different.
First, we improve the word-level pair representation to span-level pair representation with more accurate boundary information fed into our model. The tagging scheme of our model is illustrated in Figure 3, in which the main diagonal are filled with AT and OT accompanying entries to the right of the main diagonal with span pairs. Compared to , our method may heavily reduce the redundancy aroused by AT and OT tags at the right of the main diagonal.
Second, we consider the context information on both two-dimension spaces and the historical information with the utilization of the recurrent neural network (RNN). However,  merely adopted a single encoder which based on DE-CNN (Xu et al., 2018)/BiLSTM/BERT (Devlin et al., 2019) to establish token representation, and they formulated the final word-pair representation by a Thus, our dual-encoder could jointly encode AT, OT (with the corresponding context on both dimensions), and AT-OT pairs with representation information sharing.

Sequence Encoder
Following the previous work of Vaswani et al. (2017), we construct the sequence encoder as a Transformer network.
Here we apply a stack of m self-attention layers, shown in Figure 2. Each layer consists of two sublayers: namely multi-head attention sublayer, feed-forward sublayer, at the top of each sublayer followed with both residual connection and layer normalization.

Multi-head Attention Sublayer
In this section, the token representation x i is fed into a multi-head attention sublayer.
At first of our sequence encoder, the token representation x i will be mapped into vector space as query Q i , key K i , value V i : then the value vectors of all positions will be aggregated according to the normalized attention weight to get the single-head representation: Then with multi-heads attention, our model builds up representations of the input sequence: We adopt the residual connection and layer normalization (Ba et al., 2016) on r i and x i :

Feed-Forward Sublayer
The outputs of the multi-head attention are fed into a feed-forward network: where W 1 , W 2 , ∈ R d×d/m and b 1 , b 2 ∈ R d . At last, the sequence representation will be performed by layer normalization with residual connection:

Pair Encoder
As shown in Eq. (3), our task-specific pair representation is an n × n matrix of vectors, where the vector at row i and column j represents i-th and j-th word pair of the input sentence. For the l-th layer of our network, we first add a Multi-Layer Perception (MLP) layer with ReLU (Nair and Hinton, 2010) to contextualize the concatenation of representations from the sequence encoder: Then we utilize the multi-dimensional recurrent neural network (MDRNN) (Graves et al., 2007) and gated recurrent unit (GRU) (Cho et al., 2014) to contextualize S l,i,j . The contextualized pair representation P i is computed iteratively from the hidden states of each cell: The pair encoder does not consider only the word pair at neighboring rows and columns but also those of the previous layer.

Training
Given a sentence s with pre-defined tags AT , OT , and SP ∈ {POS, NEU, NEG}, we denote the AT or OT tag of token ω i as a i and the SP tag between the tokens ω i and ω j as t ij . To predict the label of the posterior of the aspect/opinion termsŷ, we apply a softmax layer on the sequence embedding of aspect/opinion terms S l . Similarly, to obtain the distribution of sentiment polarity type labelv, we apply softmax on the pair representation of P l : where W term and W pola are learnable parameters. At the training, we adopt the Cross-Entropy as our loss function. For the gold aspect and opinion term a i ∈ AT OT and gold polarity t ij ∈ SP, the training losses are respectively: where the y and v are the gold annotations of corresponding terms.
To jointly train the model, we utilize the summation of these two loss functions as our training objective: 3 Experiments

Data
To make a fair comparison with previous methods, we adopt two versions of datasets for the ASTE task: (1) (2) ASTE-Data-V2, the refined version annotated by , with additional annotation of implicitly overlapping triplets. Furthermore, the name of each dataset is composed of two parts. The former part denotes the year when the corresponding SemEval data was proposed, and the latter part is the domain name of the reviews on restaurant service or laptop sales. Data statistics of them is shown in Table 9. Then, for the AESC task, we adopt the dataset annotated by Wang et al. (2017), which is composed of three datasets, and the statistics is shown in Table 10. The implementation details of our dual-encoder model are unfolded in Appendix A.2 for the sake of putting main concentration on our argument. Our code will be available at https://github.com/Betahj/PairABSA.

Results on the ASTE Task
Our model will compare to the following baselines on the ASTE task, and more details about these baseline models are listed in Appendix A.3. 1) RINANTE+ (Peng et al., 2020).
The main results of all the models on the ASTE task are shown in Table 1. Compared with the best baseline model (Huang et al., 2021), our BERTbased dual-encoder model achieves an improvement by 1.39, 0.53, 0.68, and 2.92 absolute F 1 score on benchmark datasets. This result signifies that our dual-encoder model is capable of capturing the difference between AT/OT extraction subtask and SC subtask with the help of the additional pair encoder. Besides, our ALBERT-based model significantly outperforms all the other competitive methods on most metrics of 4 datasets 14Rest, 14Lap, 15Rest and 16Rest except for precision score of 15Rest. Most notably, our ALBERT-based model achieves an improvement of 6.66, 4.72, 9.08, and 4.49 absolute F 1 score over all the baseline models on four benchmark datasets, respectively. This result demonstrates the superiority of our dualencoder model. However, we notice that our precision score of 15Rest is comparable to , which might be due to our model is more biased towards positive predictions but that the F1 score still suggests it is an overall improvement.
The similar phenomenon that our BERT-based dual-encoder model shows larger improvements in F1 scores on 14Rest (1.39) and 16Rest (2.92) than on 14Lap (0.53) and 15Rest (0.68) verifies the explanation of  on large distribution differences of 14Rest and 15Rest. Nevertheless, we also observe a different phenomenon that our ALBERT-based dual-encoder model achieves significant F 1 score improvements on 14Rest (6.66) and 15Rest (9.08), better than 14Lap (4.72) and 16Rest (4.49), makes a challenge to the explanation developed by . From our perspective, it might be due to the different fitting degree between the distribution of ASTE-Data-V2 datasets and corresponding pre-trained language models. Additionally, we evaluate our model on the ASTE-Data-V1 datasets and then experimental results further demonstrate the effectiveness of our dual-encoder model. These results are shown in Table 8 of the Appendix.

Results on the AESC Task
For the AESC task, our model will compare to the following baselines: 1) SPAN-BERT .
To investigate whether the performance of our model on the AESC task maintains the same efficiency as the ASTE task, we conduct a series of experiments on AESC datasets. Results of all the models on the AESC task are shown in Table 2.   Compared with the best baseline model of Mao et al. (2021), the performance of our model is not comparable except for the absolute F 1 score on AE and OE of 15Rest dataset. Then, to excavate the contribution of our dual-encoder structure on the AESC task, we evaluate our model on the baseline without the pair encoder. From Table 2 we can see that the performance of our dual-encoder model is comparable on the AESC task than singleencoder structure. The AESC task is only a simplified version of the ASTE task without taking AE/OE paring and sentiment polarity classification into consideration reversely, which is the training objective of our joint model with the help of taskspecific structure design. Consequently, our model is incapable of functioning well in the AESC task.

Different Pre-trained Language Models
We conduct the experiment on the 14Lap of ASTE-Data-V2 datasets to excavate the performance of three frequently utilized pre-trained language mod-els (PLMs): XLNet , RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2020). Table 3 shows that ALBERT helps achieve the best result among these four PLMs. However, even with BERT as the baseline model Huang et al., 2021), our model also performs better. We also notice that, different from most models, our model is sensitive to different PLMs. Specifically, the absolute F 1 score between BERT and RoBERTa, ALBERT is 3.90 and 7.05, respectively. It demonstrates that our model performance could effectively be boosted by our choice of PLM, and thus we choose ALBERT as our base encoder.  Table 3: Comparison of our model with different pretrained language models on 14Lap test set of ASTE-Data-V2.

Dual-encoder Structure
Therefore, the joint modeling method must take not only the fitting degree between individual modules and subtasks but also the difference of each module into consideration.

Number of Encoder Layers
The results with different numbers of encoder layers are in Figure 4. Generally, the performance of triplet extraction synchronously increases with the number of encoder layers of both dataset distributions. Nevertheless, when the number of encoder layers exceeds 3, the performance shows a continuous decreasing trend, except that on 16Rest when the number of encoder layers is increased to 7, the performance increases by nearly 2.5 absolute F 1 score. Despite this inconsistent phenomenon, to mainly consider computational/time complexities, we adopt 3 as the number of encoders.  and backward GRU results in four quadrants of same dimension space. We observe that the Quaddirectional setting significantly outperforms the other two settings. It is also noteworthy that the performance gap between Bi-directional and Unidirectional dimensions is much lower than the gap between Quad-directional and Bi-directional dimensions, which might be the reason why most previous work using bidirectional modelings cannot perform well. Thus, we choose Quad-directional as the dimensional setting of our multi-dimensional RNNs.

The Effect of Character-level Representation
To investigate the contribution of character-level representation to our input sequence, we remove the character-level representation generated by LSTM. Experimental result shows that the performance decreases by 0.44 absolute F 1 score.

Case Study
To investigate why our model far exceeds the baseline models, we conduct a case study of three typical cases from 14Lap test dataset of ASTE-Data-V1, as shown in Table 6. From Example-1, we observe that our model is able to handle the one-to-one case. However, our dual-encoder structure is more biased towards coordinative relation between colors and speedy. More cases we investigated further demonstrating that our model performs slightly worse on on-toone than one-to-many and many-to-many relation types. From Example-2, we see that our model can tackle the one-opinion to many-target problem. However, most previous works are even unable to tackle one-opinion to two-target. From Example-3, we observe that our model is capable of well handling the one-target to many-opinion problem, which is neglected by most of the existing work but important for triplet extraction. Because many sentences compose conflicting sentiments on target, the model will fail to recognize the opposite Example-1 Also stunning colors and speedy.  polarity of the same AT when the incorrect AT extraction happens. Finally, we also observe that our model accurately inferences the boundary of OSX Lion span, which demonstrates the usefulness of our transformation that utilizes span to replace the word. From Example-4, we notice that our model could efficiently handle the complex situation of many-opinion to many-target with long-range dependency, which was particularly paid attention to but not solved well by Zhang et al. (2020a). It is due to incorporating the self-attention mechanism and GRU in two dimensions, and our model is sensitive to the difference between the proposed dual-encoder architecture. Collectively, these aforementioned cases demonstrate the robustness of our dual-encoder model.

Related Work
Recently, NLP has been developed rapidly Li et al., 2019b;Jiang et al., 2020;, and the process is further by deep neural networks (Parnow et al., 2021;Li et al., 2021a) and pre-trained language models (Li et al., 2021b;Zhang et al., 2020b). Aspect-based sentiment analysis was proposed by Pontiki et al. (2014) and also received lots of attention in recent years.

ASTE Task
The ASTE task aims to make triplet extraction of aspect terms, opinion terms, and sentiment polarity, which was introduced by Peng et al. (2020). In their work, they leveraged the sequence labeling method to extract aspect terms and target sentiment and utilized graph neural networks to detect candidate opinion terms. Zhang et al. (2020a) proposed a multi-task framework that decomposes the original ASTE task into two subtasks, sequence tagging of AT/OT, and word pair dependency parsing. For joint learning,  proposed a sequence tagging framework based on LSTM-CRF.  constructed an encoder-decoder model to handle this task with grid representation of aspectopinion pairs. Then with the incorporation of a more specific semantic information guide for the proposed model, the ASTE is transformed as MRC task (Chen et al., 2021;Mao et al., 2021). Recently, Huang et al. (2021) proposed a sequence taggingbased model to perform representation learning on the ASTE task.

AESC Task
The AESC task is to perform aspect terms extraction and sentiment classification simultaneously.  and  used a span-level sequence tagging method to tackle huge search space and sentiment inconsistency problems. Although the huge search space issue has been solved by , there still exists a lowperformance problem. Addressing this issue, Lin and Yang (2020) utilized a BERT encoder to contextualize shared information of target extraction and target classification subtasks. Meanwhile, they used two BiLSTM networks to encode the private information of each subtask, which greatly boosted the model performance.

Dual-encoder Structure
Productive efforts were put into the research of dual-encoder structure for natural language processing tasks in the last few years because of the natural ability to model representational similarity maximization associated tasks (Chidambaram et al., 2019;Yu et al., 2020;Bhowmik et al., 2021). Generally, these approaches encoded a single component of the approaches encoded a single component of the corresponding task separately for the processing in the next phase. Recently, Wang and Lu (2020) proposed a sequence-table representation learning architecture for a typical triplet extraction task: relation extraction, and this work established an example of tacking the triplet extraction task with the dual-encoder based architecture.

Conclusion
In this paper, we observe the significant differences between the AT/OT extraction subtask and the SC subtask of ABSA for the joint model. Specifically, the results on 8 benchmark datasets with significant improvement over state-of-the-art baselines verify the effectiveness of our proposed model. Furthermore, to distinguish such differences and keep the shared part between different modules simultaneously, we construct a dual-encoder framework with representation learning and self-attention mechanism. In addition to the encoder-sharing approach, our dual-encoder framework can capture the difference between the subtasks by interconnecting encoders at each layer to share the critical information.

Acknowledgement
We appreciate Wang and Lu for their provided open resource. Based on this, we conducted our work on ABSA. We also appreciate the help from the reviewers and program chairs.

A.1 Evaluation Metric
We adopt F 1 score as our evaluation metric as other baseline models. In precise, we measure the F1 score calculated between the final exact match of AT/OT span, AT/OT types and corresponding polarity predictions and gold triplets.

A.2 Implementation Details
For the token representation, we utilize 100dimensional GloVe (Pennington et al., 2014) as initialization and restrict the update of word embedding. The hidden size is 200. The decay rate is 0.05, and the decay steps are 1000. Besides, to further boost the performance of our proposed model, we utilize the ALBERT-xxlarge-v1 (Lan et al., 2020) as our pre-trained language model. We also use Adam with a learning rate of 0.001 and update parameters with a batch size of 24. Training is limited to the preset max steps. All models are implemented on the TITAN RTX. More implementation details are listed in Table 7.

A.3 Baselines
Our model will compare to the following baselines on the ASTE task. 1) RINANTE+ (Peng et al., 2020). The model RINANTE is modified from that by Ma et al. (2018). RINANTE+ is an LSTM-CRF model which first uses dependency relations of words to extract opinion and aspects with the sentiment. Then, all the candidate aspect-opinion pairs with position embedding are fed into the Bi-LSTM encoder to make a final classification.
2) CMLA+ (Peng et al., 2020). The model is adjusted from the one by Wang et al. (2017), which is an attention-based model, following the same two-stage processing with dependency relations as RINANTE+.
3) Li-unified-R (Peng et al., 2020). Li-unified-R utilizes a modulated multi-layer LSTM encoder by Li and Lu (2019), and adopts the same aspectopinion pair classification as RINANTE+. 4) Peng et al. (Peng et al., 2020). This model adopts GCN to capture dependency information, and at the second stage, uses the same strategy of RINANTE+ to fulfill triplet extraction. 5) OTE-MTL (Zhang et al., 2020a). A multitask learning approach that incorporates word dependency parsing boosts the performance of triplet extraction. 6) JET . This model jointly extracts all the subtasks through a unified sequence labeling method. JET t and JET o denote two different tagging forms. 7) GTS . A sequence tagging model leverages the property element upper triangular matrix to model the extraction of aspect and opinion terms. 8) Huang et al. (Huang et al., 2021). The latest sequence labeling model which utilizes the restricted attention field mechanism and represents word-word perceivable pairs for the final classification.
For the AESC task, our model will compare to the following baselines: 1) SPAN-BERT . It is a BERTbased model which utilizes span representation to perform the AESC task.
2) IMN-BERT . It is a multi task learning model modified by  and utilizes BERT as encoder to perform aspect term extraction and sentiment classification.
3) RACL-BERT (Chen and Qian, 2020). It is a multi-layer multi-task learning model with mutual information propagation to boost the performance of the AESC task. 4) Mao et al. (Mao et al., 2021). It is a dual-MRC architecture model to detect the AT/OT and corresponding sentiment polarity by means of a two-round query answering approach.

A.4 Results on ASTE-Data-V1 for ASTE
Results on the ASTE-Data-V1 datasets also show the effectiveness of our model. But there is an interesting phenomenon that on the 16Rest test set, the result of ALBERT-based model is lower than    that of BERT-based model. It may be due to the inconsistent domain between the test set and the pre-trained language model. Table 9 and Table 10 show the statistics of the datasets we used.