Making Flexible Use of Subtasks: A Multiplex Interaction Network for Unified Aspect-based Sentiment Analysis

.


Introduction
Aspect-based sentiment analysis (ABSA), a finegrained task of text sentiment analysis (Liu, 2012), * Corresponding author aims at summarizing opinions in user comments towards different targets (also known as aspects). ABSA generally consists of three specific sub-tasks, namely, aspect term extraction (ATE), opinion term extraction (OTE) and aspect sentiment classification (ASC). For example, consider the sentence "The price is reasonable although the service is inferior.", ATE aims to extract a set of aspect terms from the sentence, i.e. {price, service}, OTE extracts the opinion words, i.e. {reasonable, inferior} , and ASC predicts sentiment polarity for each aspect that is positive over the first aspect price and negative for the second aspect service, respectively.
Prevailing solutions of ABSA treated ATE (Liu et al., 2015;Li and Lam, 2017;Angelidis and Lapata, 2018;Luo et al., 2019b;Ma et al., 2019), OTE Wang and Pan, 2019) and ASC (Wang et al., 2016b;Chen et al., 2017;He et al., 2018;Li et al., 2018b;Xu et al., 2021) as separate tasks and were individually studied for decades. These separate tasks need to be integrated into a pipeline for practical use (Hu et al., 2019;Phan and Ogunbona, 2020). The key problem with pipeline approaches is that errors can accumulate and that the pipeline model fails to fully exploit the interactive relations among different sub-tasks (He et al., 2019). Some recent efforts have been proposed to remedy these issues by using joint learning to enhance the interactions among sub-tasks (Wang et al., 2018;He et al., 2019;Luo et al., 2019a;Chen and Qian, 2020;Peng et al., 2020) and achieved better performance than pipeline solutions. To name some,  incorporated ATE and ASC and formulated the problem as a single sequence Sentence Food is pretty good but the orange juice is horrific . labeling task with a collapsed tagging scheme 1 . (Luo et al., 2019a) treated ATE and ASC as two sequence labeling problems to mitigate the confused representation brought by the collapsed tags. More recently, (He et al., 2019;Chen and Qian, 2020;Peng et al., 2020) proposed to utilize OTE as an auxiliary task and further facilitated performance. Despite their success, existing solutions severely suffer from the following disadvantages: firstly, the approaches fusing ATE and ASC cannot benefit from OTE even though there exists opinion term annotations; secondly, the joint learning model might not work when any of the sub-tasks is absent; thirdly, the interactive relations among sub-tasks are not appropriately explored. For instance, (Luo et al., 2019a; would fail to exploit OTE even if these exists labeled data for training. (Chen and Qian, 2020;Peng et al., 2020) might become fragile when the sub-task of OTE or ATE is absent, due to their special designed complex interactions among the sub-tasks of ATE, OTE and ASC. In a nutshell, the key reason for these disadvantages is that existing studies only consider specific interactive relations among a fixed number of sub-tasks, but are inadequate to use the available sub-tasks flexibly.
In this paper, we propose a simple yet effective neural method named MIN (Multiplex Interaction Network) that can make flexible use of sub-tasks for unified ABSA task. The flexibility comes in two folds: First, the number of sub-tasks is not restricted and any related sub-task can be integrated into the unified framework. Second, the interactive relations among distinct sub-tasks are flexibly modeled in an explicit manner.
To tackle the major challenge of how to manage ABSA with arbitrary number of sub-tasks and allow sub-tasks to share interactive information in a unified learning manner, we divide the sub-tasks of ABSA into two categories, namely extractive subtasks (e.g. ATE and OTE) and classification subtasks (e.g. ASC), and formulate each sub-task as a sequence labeling problem. MIN is built upon a multi-task learning framework of the shared-private scheme (Collobert and Weston, 2008). Sub-tasks first jointly extract the low-level shared features using multi-layer CNN encoder/pre-trained models. Then the private features of each extractive sub-task are independently learned by a multi-layer CNN network with a pairwise attention mechanism. Such pair-attention can capture bidirectional interactions between any two extractive sub-tasks accordingly. Moreover, we devise a consistencyweighting mechanism to exploit unidirectional interactions from extractive sub-tasks to classification sub-tasks and assist in private feature extractions in classification sub-tasks. The proposed interaction mechanisms (including pair-attention and consistency-weighting) give the model the flexibility to handle different ABSA situations with different number of sub-tasks of different types. In addition, MIN adopts an information feedback mechanism that first aggregates information from all available tasks and then propagates useful information back to individual tasks, allowing different sub-tasks to positively influence each other. Experiments on two widely used ABSA benchmarks using different numbers of sub-tasks demonstrate the proposed MIN outperforms both pipeline and joint learning SOTA baselines on various settings.

Related Work
Separate learning mainly focuses on one of the sub-tasks in ABSA, e.g. aspect term extraction (ATE), opinion term extraction (OTE) or aspect sentiment classification (ASC). For the ATE task, previous methods can be divided into unsupervised and supervised models, respectively. In unsupervised methods, (Liu et al., 2015) utilized rules about grammar dependency relations between opinion words and aspects for aspect terms extraction. (Luo et al., 2019b) presented a neural framework that leverages sememes to enhance lexical semantics for long-tailed aspect extraction.  utilized the capability of coupling global and local representation to discover aspect terms. For supervised methods (Wang et al., 2016b;Li and Lam, 2017;Angelidis and Lapata, 2018;Ma et al., 2019), they were usually formulated as a sequence tagging problem, and various neutral networks with attention mechanisms were proposed to solve the task. For ASC, (Liu and Zhang, 2017;Cheng et al., 2017;He et al., 2018;Tang et al., 2019;Liang et al., 2019;Lei et al., 2019) attempted to exploit contextual and positional proximity of aspect terms for prediction by attentional neural networks. And (Tian et al., 2020) proposed to learn a unified sentiment representation for different sentiment analysis tasks. Recently, capsule network Chen and Qian, 2019), and graph convolution networks  were also utilized in ASC and achieved SOTA performance. These separate learning approaches may have disadvantages in practical applications as they need to be pipelined and the interactions between different sub-tasks are totally neglected.
Joint learning strives to combine sub-tasks of ABSA into a unified learning process. For example, some studies proposed to handle ATE and ASC in a pipeline or an integrated model. The pipeline models (Hu et al., 2019;Phan and Ogunbona, 2020) are extract-then-classify processes and were proposed to solve the two tasks successively. However, they can still derive error accumulations. For integrated models, (Wang et al., 2018; solved ATE and ASC by collapsed tagging that is a unified tagging scheme to link the two tasks. (Luo et al., 2019a) considered the relationship between the two tasks and attempted to investigate useful information from one task to another. Some works Dai and Song, 2019;Zhao et al., 2020) integrated ATE and OTE in the same framework to illustrate these two tasks can benefit from each other. Then emerging methods (He et al., 2019;Chen and Qian, 2020;Peng et al., 2020) proposed to inject OTE as an auxiliary task to further improve the performance of ABSA. However, the number of sub-tasks and interactions among them in existing integrated methods are fixed, which can be restricted when sub-tasks vary in practice.

Task Definition
All the sub-tasks related to ABSA are categorized into extractive and classification sub-tasks, respectively, in our framework. The extractive sub-tasks extract meaningful subsequences of sen-tences, such as aspect terms and opinion terms, etc. The classification sub-tasks classify parts of sentences into different classes, e.g. sentiment polarities, genres and etc. We formulate all subtasks as sequence labeling problems in our joint learning framework. Given a sequence of tokens X = {x 1 , x 2 , ..., x n }, • An extractive sub-task T is to predict a sequence tag Y T = {y T 1 , y T 2 , ..., y T n } for the sentence X = {x 1 , x 2 , ..., x n }, where y T i ∈ {B, I, O}. Specifically, {B, I, O} denotes beginning of, inside of and out of the extracted targets. For example, the first and the second rows of Table 1 demonstrate the sequence tags of aspect term extraction (ATE) and opinion term extraction (OTE) respectively. In the example, "food" and "orange juice" are annotated as aspect terms, and "good" and "horrific" are labeled as opinion terms.
i is one of the class labels. y C i will be marked as "-" if the token x i does not belong to any category. For example, the last row of Table 1 demonstrates the labels of the aspect sentiment classification (ASC) task for the example sentence. In this example, the "food" is labeled as "POS", and "orange" as well as "juice" is labeled as "NEG", according to the ground truth sentiment labels of the two aspect terms. Other tokens are labeled as "-" as they do not have golden labels in this task.

Model Overview
For the convenience of explanation, we take two extractive sub-tasks, namely ATE and OTE, and a classification sub-task ASC to describe our MIN approach hereafter. The overall architecture of MIN is shown in Figure 1. MIN consists of a shared encode layer, several features extraction modules for different sub-tasks, and an information feedback mechanism.
Private features extraction modules accept the same representation from a shared encode layer then extract their private task-oriented features by specific multi-layer CNN algorithms. In order to enhance flexibility, MIN conducts task interactions for two facets. Firstly, we exploit the bidirectional interaction between two different extractive subtasks by a proposed pair-attention. Using pair-  attention, extractive sub-tasks can exchange helpful clues with each other during features extraction. Secondly, we use a consistency-weighting mechanism to perform the unidirectional interaction from an extractive sub-task to a classification sub-task. In this way, features with highlight information of extractive sub-tasks will be passed to other classification sub-tasks and assist in their features extractions. Then every sub-task predicts the corresponding sequence labels, i.e.,Ŷ A ,Ŷ O andŶ S , by its decode layer. The model also adopts an information feedback mechanism that concatenates representations of all sub-tasks to fine-tune the shared representations. In the following, we first describe the MIN model in more detail and then illustrate the learning process.

Shared Representation Generation
For a sequence of tokens {x 1 , x 2 , ..., x n }, we map the word sequence with either pre-trained double word embeddings (Xu et al., 2018)

Features Extraction for Extractive
Sub-tasks

Multi-layer CNN
For the extractive sub-tasks, we use a multi-layer CNN structure proposed by (Xu et al., 2018) to learn private features of each task separately. Specifically, there are many 1D-convolution filters in each CNN layer, and each filter has a fixed kernel size of k = 2c + 1. As a result, each filter performs convolution operation on a window of k word representations, and compute the representation for the i-th word along with 2c nearby words in its context. We can extract private features H A of ATE and H O of OTE by the above multi-layer CNN algorithm, where MC indicates the multi-layer CNN algorithm, d p denotes the dimension of the private features for extractive sub-tasks.

Pair-attention
In order to exploit potential interactions, we develop a pair-attention to exchange helpful clues between ATE and OTE based on their semantic correlations. For ATE, the pair-attention matrix A A←O ∈ R n×n is computed as follows: . (2) By applying a weighted sum operation of pairattention to the private features H O of OTE, we get an interactive feature matrix H A←O for ATE: Similarly, according to Eq. 2, we exchange H A and H O to compute pair-attention matrix A O←A ∈ R n×n then use A O←A and H A to get interactive feature matrix H O←A for OTE by Eq. 3. In this way, the model exploits bidirectional interactions between the two sub-tasks. After pair-attention, we concatenate the private and the interactive features of the same sub-task, and then feed them to a fullyconnected layer to predict the sequence tags as follows:Ŷ Note that for other extractive sub-tasks, we can also compute the pair-attention with related extractive sub-tasks in a similar manner.

Features Extraction for Classification Sub-tasks
Recall that a classification sub-task may benefit from extractive sub-tasks. To name some, knowing the location of aspect terms will enhance the matching degree of aspect terms with corresponding sentiment polarities. For example, the second row and the last row of Table 1 have consistent labels 2 since we expect to perform sentiment prediction for aspect terms than other words. Similarly, the sentiment polarities of aspect terms are mainly influenced by surrounding opinion words. For example, we can find "NEG" labels near to the opinion word "horrific" in the example sentence shown in Table 1.

Consistency-weighting
Inspired by the above observations, we devise a consistency-weighting mechanism to learn private features of ASC task.
First, we compute a set of weights to pass information with salient aspect and opinion terms to ASC. Specifically, the weights of a token i w.r.t. aspect and opinion terms are computed as follows: where k denotes the first k tokens most likely to be aspect or opinion terms, and a j and o j are the indexes of j-th aspect and opinion term in the sentence. α, α , β and β are pre-specified constants. We can simply multiply the weights w A and w O with interactive features of ATE and OTE to get a distilled representation of every token: Then the shared features are concatenated with such distilled representation, and the private features of ASC are extracted by the same multi-layer CNN algorithm as in the ATE and OTE sub-tasks.
By consistency-weighting, the information of aspect and opinion terms are highlighted during the features extraction for the classification sub-task.

Self-Attention
We adopt a self-attention mechanism to learn longdistant dependent information in a sentence for the classification sub-task. Specifically, the selfattention matrix A S↔S is computed as follows: where H S i W S↔S (H S j ) T and 1 |i−j| denote the semantic relevance and distance between H i and H j , respectively. After getting the outputs of the selfattention layer, we also exploit a fully-connected layer with softmax activation as a decoder: whereŶ S is the predicted label sequence for ASC.

Information Feedback Mechanism
MIN exploits an information feedback mechanism to interactively fine-tune the shared feature representations. We denote t as the times of the information feedback. In each round, we utilize the hidden representation before decoder in every sub-task to update the shared features. Specifically, we concatenate the hidden representations with the shared feature matrix, and then use a fully-connected layer with softmax activation to squeeze the vectors to a fixed dimension. (10) where t denotes the t-th feedback update.

Training Procedure
The overall loss L consists of cross-entropy losses for all sub-tasks according to final predicted sequence labels.
where N denotes the number of training instances, n i denotes the number of tokens in the i-th instance. The overall loss is the weighted sum of the subtasks' losses: where a, b, c are task coefficients, and we set all of them as 1. Following (He et al., 2019), during model training, we only compute ASC loss on tokens which are related to aspect terms, i.e., if a token is not aspect term, we will ignore predicted the sentiment on it. We minimize the L and determine a suitable t by grid search for information feedback mechanism during the experiment.  We divided all methods into pipeline and unified models. The best scores are in bold face and the second best ones are underlined. "*" denotes the results are taken from their own papers or (Chen and Qian, 2020).

Datasets
We adopt two widely used datasets from Se-mEval2014 Task 4 (Pontiki et al., 2014) in our experiment which contain reviews about restaurant and laptop, and the statistics are shown in Table 2.
Note that original datasets have ground truth labels for aspect terms and corresponding sentiment polarities, while labels for opinion terms are annotated by (Wang et al., 2016a).

Compared Methods
We compare our method with recent ABSA methods. To demonstrate the flexibility of our method, we divide the baselines into two categories, namely double-task (ATE and ASC) and triple-task (ATE, OTE and ASC) methods, considering that existing methods are usually designed for fixed sub-tasks. Double-task methods contain (1) pipeline methods: we take top-performing method DECNN (Xu et al., 2018) for ATE, TNet (Li et al., 2018a) and TCaps (Chen and Qian, 2019) for ASC to construct two pipeline models following (Chen and Qian, 2020). We conduct a pipeline setting of our proposed MIN denoted as PIPELINE-MIN, which trains ATE and ASC separately, for fair comparison.
(2) unified methods: MNN (Wang et al., 2018) and E2E-ABSA  jointly solve ATE and ASC by using collapsed tagging schemes, and DOER (Luo et al., 2019a) treats these two tasks as two sequence labeling problems and utilized a dual cross-shared RNN.
Triple-task methods aim to solve ATE, OTE and ASC simultaneously, including (1) pipeline methods: following (Chen and Qian, 2020), we combine CMLA (Xu et al., 2018) for ATE, TNet and TCaps for ASC to construct two pipeline baselines. OTE is integrated into ATE. (2) unified methods: IMN (He et al., 2019) is an interactive multi-task model jointly trained on ATE and ASC where OTE is also integrated into ATE. While RACL (Chen and Qian, 2020) is trained on ATE, OTE and ASC in parallel that considers four relations among the three sub-tasks.

Settings
Following (He et al., 2019), we adopt double embedding in word embedding layer of MIN, where each word embedding is a concatenation of general embedding (Pennington et al., 2014) with 300 dimensions and domain embedding (Xu et al., 2018) with 100 dimensions. We set the hidden size d e = 400, d h = 300, d p = 300, and the kernel size, number of shared CNN layers to 5 and 2, individually. And the numbers of information feedback in Res14 and Lap14 are set to 2 and 3 respectively. The layers of multi-layer CNN for ATE, OTE and ASC are set to {2, 2, 1} and the kernel size is 5 in two dataset. Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-4 and a batch size of 8 are utilized for all datasets.
We also combine MIN with BERT Large to get MIN-BERT and {d e , d h , d p } is set to {1024, 300, 300} with a learning rate of 1e − 5 by grid search. The layers of multi-layer CNN for ATE, OTE, ASC and the number of information feedback are set to {1, 1, 1, 2} in the two datasets. The other parameters are the same as the MIN model with the double embedding. We meanwhile alternate IMN and RACL with BERT Large for further detailed comparison. We use the same metrics as (He et al., 2019), i.e., AE-F1, OE-F1, AS-F1 and O-F1, representing macro F1 scores for ATE, OTE, ASC and overall performance for complete ABSA. We select the model with the best O-F1 on the development set for test. Table 3 and 4 demonstrate the performance of our method and the compared double-task (i.e., ATE and ASC) and triple-task methods (i.e., ATE, OTE and ASC), respectively. We have several consistent observations from the two tables. Firstly, most unified models perform better than pipeline models, which proves the advantages of exploiting interactions between sub-tasks. Secondly, our MIN can be flexibly applied to either double-task or triple-task scenario and can achieve better performance than  We divided all methods into pipeline, unified models and models based on BERT-large. "*", bolded and underlined scores have the same meanings as Table 3.  all the compared methods on O-F1. We conjecture the possible reason is that the proposed generic interaction mechanism among different sub-task types is independent on any specific sub-task. Thus, our MIN can work well regardless of the number or the type of sub-tasks. Thirdly, we enhance the performance of some compared models with word embeddings provided by BERT Large (Devlin et al., 2018) and report the corresponding results in Table 4. We can observe that our MIN-BERT also shows superiority especially on O-F1.

Flexibility
To further investigate the flexibility of MIN, we compare our MIN with variants of strong competitors in triple-task methods by varying the combination of sub-tasks. We remove one of the three subtasks and get three different combinations, namely "ATE+OTE", "ATE+ASC" and "OTE+ASC". Note that the double-task methods in table 3 cannot be easily adapted for this comparison due to their special design for specific sub-task combinations.  outperforms both IMN and RACL on most of the cases, and the two baselines cannot work well when the sub-task combination varies since both of them designed specialized interaction mechanisms for a fixed number of sub-tasks. For example, IMN adopts OTE as an auxiliary task for ATE and thus cannot provide results on "OTE+ASC". RACL, which designs four kinds of collaborative relations among ATE, OTE and ASC, degrades when arbitrary one sub-task is unavailable. However, our MIN separates the sub-tasks by different categories and leverages two category-level interaction mechanisms. It derives a trade-off that the interactions on remaining sub-tasks are still functional although one sub-task is removed. That is where the flexibility of our model comes from.

Ablation Test
In order to verify the effectiveness of different interactive mechanisms in our model, we conduct ablation tests on "ATE+OTE+ASC" unified ABSA. We remove different interaction mechanisms between sub-tasks in turn to observe the effectiveness of them. As the O-F1 results shown in Table 6, all interaction mechanisms are effective in MIN as expected. For example, on Res14, the scores of model without consistency-weighting and the model without self-attention drop largely. It is believed that sentiment polarities are mainly influenced by corresponding aspect and opinion words.

Case Study
Next, we illustrate two cases generated by IMN, RACL and our MIN for further investigation in Table 7. In Case (1), IMN extracts "choice" as an extra aspect without considering the relations between ATE and OTE. In addition, RACL incorrectly identifies "choice" and "craving" as aspect and opinion terms. We conjecture the possible reasons might be the "choice" is paired with "exquisite" as an aspect and the "craving" is recognized as a term expressing opinions by the specifically-designed interaction relations in RACL.  However, our proposed pair-attention of MIN is more general, and correctly extracts all the aspect terms and opinion terms. Case (2) explores the effectiveness of our proposed consistency-weighting. Specifically, although IMN extracts the correct opinion word "too sweet", the final sentiment of "mole sauce" is still predicted wrong with a weak help of opinion terms. Meanwhile, although RACL exploits the relation between OTE and ASC, it recognizes the "sweet" incompletely, which might because RACL cannot update representation without the information feedback mechanism. As a result, "sweet" interferes sentiment prediction and leads to a wrong polarity for "mole sauce". In contrast, MIN correctly predicts sentiment polarities by using consistency-weighting based on complete aspect and opinion terms.

Times of Information Feedback
Finally, we investigate the sensitivity of the parameter t in the information feedback mechanism. and Lap14 as t increases. We observe that MIN achieves better performance when t = 2 or t = 3. And there is no considerable improvement with a further increase of t, because redundant iterations might derive overfitting.

Conclusion
In this paper, we propose a novel neural model MIN to make flexible use of sub-tasks for the unified ABSA task. We introduce two sub-task agnostic interaction mechanisms, i.e., pair-attention and consistency-weighting, to exploit interactions among different sub-tasks and an information feedback mechanism to fine-tune the shared feature representation. Experiments with a different number of sub-tasks show the flexibility of MIN. The proposed framework can be potentially applied to similar tasks such as named entity recognition, semantic relation extraction, and etc. indicates incorrect prediction. Notice that the ground-truth aspect and opinion terms are marked in red and blue in the brackets, while the sentiment polarity labels for the aspect terms are demonstrated as corresponding subscripts.