Cross-Domain Review Generation for Aspect-Based Sentiment Analysis

Supervised learning methods have proven to be effective for Aspect-Based Sentiment Analysis (ABSA). However, the lack of ﬁne-grained labeled data hinders their effectiveness in many domains. To address this issue, unsupervised domain adaptation methods are desired to transfer knowledge from a labeled source domain to any unlabeled target domain. In this paper, we propose a new domain adaptation paradigm called cross-domain review generation (CDRG), which aims to generate target-domain reviews with ﬁne-grained annotation based on the source-domain labeled reviews. To achieve this goal, we pro-pose a two-step approach as a concrete realization of CDRG. It ﬁrst converts a source-domain review to a domain-independent review by masking its source-speciﬁc attributes, and then converts the domain-independent review to a target-domain review with a masked language model pre-trained in the target domain. We further propose two ways to leverage the generated target-domain reviews for two cross-domain ABSA tasks. Extensive experiments demonstrate the superiority of our CDRG-based approaches over the state-of-the-art domain adaptation methods.


Introduction
Aspect-Based Sentiment Analysis (ABSA) has received considerable attention in recent years (Liu, 2020). The goal of ABSA is to extract the aspect terms mentioned in review sentences and predict the sentiments over them (Pontiki et al., 2016). For example, given the review "The fish soup is delicious", the aspect term and its sentiment are fish soup and Positive, respectively. With the recent advances in deep learning, many supervised neural models have been proposed for several ABSA tasks, e.g., aspect extraction (Liu et al., 2015;Xu et al., * Corresponding author. 2018), aspect-level sentiment classification (Wang et al., 2016b;Tang et al., 2016), and End-to-End ABSA (Zhang et al., 2015;Li et al., 2019a).
Although these neural models have obtained promising results in several product domains such as Laptop and Restaurant (Pontiki et al., 2014), the major obstacle to them is the lack of rich annotated resources in many new domains. Since ABSA requires fine-grained annotation of aspect terms and their sentiments in each review, it is extremely time-consuming to develop such resources for each domain, and the annotation process can be prohibitively expensive. To alleviate the annotation efforts, unsupervised domain adaptation methods are desired to transfer knowledge from a source domain with rich labeled data to a target domain with only unlabeled data.
The key challenge of domain adaptation is that the data distribution of the source domain usually differs from that of the target domain. To alleviate the domain discrepancy, many approaches have been proposed in coarse-grained sentiment classification based on the following two paradigms: • feature-based adaptation, which aims to aims to learn a domain-invariant feature representation across domains (Blitzer et al., 2007;Glorot et al., 2011;Ziser and Reichart, 2018;Ghosal et al., 2020) • instance-based adaptation, which focuses on re-weighting labeled instances in the source domain for use in the target domain (Mansour et al., 2008;Dredze et al., 2010).

Source Domain
LAPTOP Cross-Domain Review Generation with fine-grained annotation Figure 1: Cross-Domain Review Generation with applications to the End-to-End ABSA task, in which the goal is to generate a target-domain review with fine-grained annotation given a labeled review in the source domain.
they can reduce the domain discrepancy by learning shared representations or re-weighting source instances, the supervision signals for their main task solely come from the labeled source domain; (2) both of them are lack of interpretability, as the shared representations or re-weighted instances offer little transparency regarding the knowledge transferred to the target domain.
To address the two limitations, we propose a new domain adaptation paradigm named Cross-Domain Review Generation (CDRG) with applications to the ABSA task. Given a labeled review in the source domain, the goal is to generate a target-domain review with fine-grained annotation, which converts the domain-specific attributes (e.g., aspects, opinions, and collocations) to the target domain while preserving its annotation and remaining contents. For example, in Fig. 1, a labeled review from the Laptop domain is transferred to the Restaurant domain by converting its source-specific attributes to target-specific attributes (e.g., Macbook to fish soup, lightweight to delicious, etc) but keeping other words and the labels unchanged. Different from existing text generation tasks, CDRG is challenging due to a couple of reasons. First, there is no paralleled corpus which aligns labeled source-domain reviews and unlabeled target-domain reviews. Second, given unparalleled corpus, it is non-trivial to achieve alignments between the domain-specific attributes. More importantly, the generated target review is required to have the fine-grained annotation.
To achieve this goal, we propose a simple yet effective two-step approach, containing a domain generalization step and a domain specification step. Specifically, the domain generalization step first identifies important domain-specific attributes 1 , and then mask source-specific attributes in each source review to obtain a domain-independent review. Next, given the domain-independent review as input, the domain specification step employs a pre-trained masked language model from the target domain to generate a target review. In our two-step approach, the domain-independent review serves as a bridge to achieve word-to-word alignments in source and target reviews, and thus the fine-grained annotation from the source review can be directly transferred to the target review.
We further propose two training strategies to leverage the generated target-domain reviews for two cross-domain ABSA tasks, including crossdomain End-to-End ABSA (E2E-ABSA) and aspect extraction (AE). Experiment results on four benchmark datasets show that only using our generated target-domain reviews, the baseline BERT model already outperforms the state-of-the-art domain adaptation methods, and a joint usage of generated target reviews and labeled source reviews can further boost the performance significantly.
Our main contributions can be summarized as follows: • We propose a new domain adaptation paradigm named Cross-Domain Review Generation (CDRG) with applications to the ABSA task, and then devise a simple yet effective two-step approach as a concrete realization of CDRG.
• With the help of generated target-domain reviews, our best training strategy outperforms the state-of-the-art method by an absolute improvement of 2.83% and 4.47% on Micro-F1 for cross-domain E2E-ABSA and cross-domain AE, respectively.
• As long as a source domain has sufficient annotated reviews, our two-step approach for CDRG can generate many annotated target reviews, which offer interpretable justification for domain adaptation.

Related Work
As two important tasks in ABSA, aspect extraction (Liu et al., 2015;Poria et al., 2016;Wang et al., 2016aLi et al., 2018a;Xu et al., 2018) and aspect-level sentiment classification (Dong et al., 2014;Tang et al., 2016;Wang et al., 2016b;Yang et al., 2017;Ma et al., 2017; have been extensively studied in the literature. For practical applications, a number of recent studies handle them together in an end-to-end manner, in which many supervised learning methods with discrete linear features (Mitchell et al., 2013) and continuous neural features (Zhang et al., 2015;Li et al., 2019a) have been proposed. Despite obtaining promising results, their main limitation lies in the lack of annotated data in many new domains.
To address this data sparsity problem, unsupervised domain adaptation methods are desired.
Most existing domain adaptation studies focus on coarse-grained sentiment classification to learn domain-invariant representations, including pivotbased methods (Blitzer et al., 2007;Pan et al., 2010;Yu and Jiang, 2016), auto-encoders (Chen et al., 2012;Zhuang et al., 2015), semi-supervised methods (He et al., 2018;Ye et al., 2020), and domain adversarial networks (Ganin et al., 2016;Li et al., 2018b). Besides, another line of work focuses on re-weighting source instances to automatically find useful source samples for the target domain (Jiang and Zhai, 2007;Mansour et al., 2008;Dredze et al., 2010;Xia et al., 2014). Due to the challenges in fine-grained adaptation, there exist only a few studies for cross-domain aspect and opinion extraction (Li et al., 2012;Ding et al., 2017;Wang and Pan, 2018), or End-to-End ABSA (Li et al., 2019b;Gong et al., 2020). However, these methods still follow the traditional domain adaptation paradigms to either learn shared representations or perform instance weighting. Different from these methods, we propose to accomplish domain adaptation for ABSA based on Cross-Domain Review Generation.

Problem Formulation
In this paper, we consider two ABSA tasks, i.e., End-to-End ABSA (E2E-ABSA) and Aspect Extraction (AE). Following Li et al. (2019b), we formulate both tasks as sequence labeling problems. Formally, given a sequence of input tokens x = {w 1 , w 2 , ..., w n }, its label sequence is denoted by y = {y 1 , y 2 , ..., y n }. Let y j ∈ {B-POS, I-POS, B-NEG, I-NEG, B-NEU, I-NEU, O} be the label space for the E2E-ABSA task, and y j ∈ {B, I, O} be the label space for the AE task. Cross-Domain ABSA. We focus on the unsupervised domain adaptation setting, in which labeled data are only available from the source domain. Specifically, we assume access to a set of labeled reviews from the source domain , and another set of unlabeled reviews from the target domain The goal is to predict the label sequence for test data in the target domain:

Methodology
Overview. In this work, we propose a new domain adaptation paradigm named Cross-Domain Review Generation (CDRG). The goal is to transform a labeled source review to a labeled target review by converting its source-specific attributes to target-specific attributes while retaining the remaining contents and its labels. To achieve this goal, we propose a simple yet effective two-step approach, as shown in Fig. 2. In the domain generalization step, we first extract the domain-specific attributes from the labeled source data and the unlabeled target data. For each labeled source review, we mask its source-specific attributes to obtain a domain-independent review. In the domain specification step, a pre-trained BERT model is first re-trained with masked language modeling (MLM) on the unlabeled data from the target domain. Next, given the domain-independent review, we employ the target-domain MLM to generate a target review. As the domain-independent review serves as a bridge to achieve word-to-word alignments between the source and generated target reviews, the source labels can be directly transferred to the target review. Finally, we propose two strategies to leverage labeled target reviews for E2E-ABSA and AE tasks, including independent training and merge training.

Proposed Two-Step Approach for CDRG
Our proposed approach for CDRG consists of two steps: (1) a domain generalization step to convert each source-domain review to a domainindependent review; (2) a domain specification step to convert a domain-independent review to a targetdomain review.

Step 1: Domain Generalization
To convert each source-domain review to a domainindependent review, we first extract the domainspecific attributes from each domain, and then mask the source-specific attributes in source reviews. Domain-Specific Attribute Extraction. We define domain-specific attributes as words, phrases, syntactic structures, and expression styles that only occur in the source domain or the target domain. However, it is often challenging to identify some attributes such as syntactic structures and expression styles, as most of them are implicitly expressed in the review. More importantly, since ABSA aims to jointly extract the aspects and sentiments, aspect and opinion terms tend to play more crucial roles than the other attributes. Therefore, we only consider domain-specific aspect terms and opinion terms as domain-specific attributes in this paper.
To extract aspect and opinion terms from the unlabeled target domain, we use a dependency relation-based unsupervised method named Double Propagation (Qiu et al., 2011). Specifically, given the unlabeled target reviews D U , we first resort to a sentiment lexicon 2 to extract the opinion terms in D U , and use the conj relation to expand the opinion term list. Next, we employ these opinion terms as seed words, and extract all the words holding the amod and nsubj relations towards any seed word.
We then treat the extracted words as aspect terms 3 , and use the nn relation to expand the aspect term list. The above three steps can be iterated until the aspect and opinion term lists are no longer updated.
Given the labeled source reviews D S , we can easily obtain the aspect term list, as the aspect terms have been annotated in each review for both E2E-ABSA and AE tasks. For opinion terms, we also utilize Double Propagation to expand its list.
After obtaining the aspect and opinion term lists for each domain, we remove all the domainindependent terms that occur in both source and target domains, and obtain the domain-specific term lists. Let us use A s and O s to denote the sourcespecific aspect and opinion term lists, and A t and O t the target-specific aspect and opinion term lists. Domain-Independent Reviews. Given each review in the source domain x s = {w 1 , w 2 , ..., w n }, if its sub-sequence is a source-specific attribute in either A s or O s , we substitute each word in the sub-sequence with a special token [MASK] to obtain a domain-independent review, denoted by x m . For example, in Fig 2,

Step 2: Domain Specification
The domain specification step is responsible for incorporating target-specific attributes into each domain-independent review x m . In our work, we propose to transform it as a text infilling problem. To address this, we pre-train a Transformer network with the masked language modeling (MLM; Devlin et al. (2019)) task on unlabeled data in the target domain, followed by predicting a target-specific word for each masked token in x m . Target-Domain MLM. Since the size of the unlabeled target-domain data D U is typically small, we adopt a pre-trained BERT model (Devlin et al., 2019) and re-train the BERT model with MLM on D U . Specifically, we create training instances by randomly replacing a subset of tokens with [MASK] in each unlabeled review x u , and the objective is to recover the masked tokens based on the hidden states from BERT.
With the target-domain MLM (TD-MLM), we can infill each masked position in x m based on their context. Let us use M = {m 1 , m 2 , ..., m K } to denote the indexes of the masked tokens in x m , where K refers to the number of masked tokens. The predicted word for the j-th masked token can be computed as follows: where V denotes the whole vocabulary, and o m j is the output word with the highest probability. However, the TD-MLM suffers from two major limitations: (1) the predicted token with the highest probability in Eqn. (1) may not be an aspect term or an opinion term; (2) each masked token is predicted independently, and thus it is possible that the predicted tokens for two consecutive masked positions are not coherent. Target-Specific Aspect Constraint. To tackle these two limitations, we first propose to utilize the target-specific aspect terms (i.e., A t ) extracted in Section 4.1.1 as vocabulary constraints to limit the prediction space of each masked aspect term.
Specifically, if the masked aspect term corresponds to a single-word term in the source review (e.g., keyboard), the word selection in Eqn. (1) can be modified as follows: where A 1 t refers to the sets of single-word aspect terms in A t . Otherwise, if the masked aspect term corresponds to a multi-word term, we compute the joint word probabilities of each multi-word term in A t followed by re-ranking them. Let us use k to denote the number of consecutive masked tokens, and m j:j+k to denote the span of the masked aspect term. The word selection for k consecutive masked tokens can thus be computed as follows: where w m j:j+k ∈ A k t , and A k t refers to the sets of k-word aspect terms in A t . Target-Specific Opinion Constraint. For each masked opinion term, it is important to keep the sentiment consistency when predicting its corresponding target-specific opinion term.
To achieve this, we resort to the Double Propagation algorithm, which relies on the sentiment lexicon to assign sentiment (i.e., Positive, Negative, and Neutral) to each aspect term and opinion term in a bootstrapping manner. Based on the output, we obtain the sentiment of all the source-specific and target-specific opinion terms (i.e., O s and O t ) extracted in Section 4.1.1.
Next, we look up the sentiment of the masked source-specific opinion term, and then utilize all the target-specific opinion terms with the same sentiment in O t as vocabulary constraints in Eqn. (2) and Eqn. (3) to generate the target-specific opinion terms. For example, in Fig 2, since dead is a source-specific negative opinion term, we use all the single-word target-specific opinion terms with the negative sentiment as vocabulary constraints in Eqn.
(2) to generate the opinion term tasteless. Generated Target-Domain Reviews. Based on Eqn.
(2) and Eqn. (3), we can infill the masked positions in each domain-independent review x m , and obtain the generated target-domain review, denoted by x g . It is worth noting that if a source-domain review x s does not contain any source-specific attributes, its generated target-domain review x g will be the same as x s .
With the domain-independent review x m , our two-step approach essentially achieves word-toword alignments between x s and x g . Therefore, we can directly employ the sequence label of x s as the fine-grained annotation of x g . Fig. 2 shows a label transferring example for the E2E-ABSA task.
Formally, we use D G = {(x g i , y s i )} N s i=1 to denote the set of the generated target-domain reviews.

Post-Generation Training for Main Tasks
After obtaining the set of generated target-domain reviews D G , we further propose two strategies to leverage them to train effective E2E-ABSA and AE models for the target domain as follows. Independent Training. An intuitive strategy is to solely treat D G as training instances, and directly train a sequence labeling model over them. Following Gong et al. (2020), we adopt the pre-trained BERT model as the text encoder, followed by funetuning it on D G . Merge Training. Since the qualities of the generated target-domain reviews rely on the targetdomain MLM as well as the aspect and opinion terms extracted by Double Propagation, it is inevitable that D G contains a number of aspect and opinion terms with incorrect annotations. Therefore, we propose to merge the labeled source reviews D S with D G as the training instances, which may alleviate the annotation noise in D G . Similar to Independent Training, a BERT-based sequence labeling model is trained over the merged corpus D S ∪ D G .

Experiment Settings
Datasets. We use four benchmark datasets including Laptop (L), Restaurant (R), Device (D), and Service (S) for experiments. L is from SemEval-2014 ABSA challenge (Pontiki et al., 2014), containing user reviews from the laptop domain. R refers to a combination of the restaurant datasets from SemEval ABSA challenge 2014, 2015, and 2016 (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016. D is the union set of all the digital device reviews collected by Toprak et al. (2010). S contains reviews from web services, which is introduced by Hu and Liu (2004). The basic statistics are shown in Table 1.
Settings. We carry out experiments on 10 transfer pairs with the four domains above. Following previous work (Li et al., 2019b;Gong et al., 2020), we remove D→L and L→D, as the two domains are very similar. For each transfer pair, the training data is a combination of the labeled training data in the source domain and the unlabeled training data in the target domain. We report the evaluation results on the test data from the target domain. For fair comparison with previous work, we use the Micro-F1 score with the exact match as the evaluation metric, which means that the predicted labels can be counted as correct only if they are exactly matched with the golden labels.

Compared Systems & Hyperparameters
To show the effectiveness of our Cross-Domain Review Generation (CDRG)-based methods, we consider the following compared systems: • DP (Qiu et al., 2011): the unsupervised Double Propagation method detailed in Section 4.1.1. • Hier-Joint (Ding et al., 2017): An LSTM-based domain adaptation method with syntactic rulebased auxiliary tasks for cross-domain AE. • RNSCN (Wang and Pan, 2018): A recursive neural structural correspondence network based on syntactic structures and auto-encoders. • AD-SAL (Li et al., 2019b): A Selective Adversarial Learning method to achieve local semantic alignments for fine-grained domain adaptation. • BERT B and BERT E : directly fine-tuning two versions of pre-trained encoders on the labeled source domain. BERT B is the uncased BERT base model from Devlin et al. (2019), and BERT E is another uncased BERT base model from Xu et al. (2019), which is pre-trained on E-commerce reviews from the Amazon Electronics dataset (He and McAuley, 2016) and the Yelp Challenge. • BERT B -UDA and BERT E -UDA (Gong et al., 2020): our recent unified feature and instancebased domain adaptation method based on BERT B and BERT E , respectively.
Besides, we consider the following variants of our CDRG-based domain adaptation methods: • BERT B -CDRG-X: re-training BERT B with the MLM task to obtain the Target-Domain MLM in Section 4.1.2, and employing BERT B as the base model for the two training strategies in Section 4.2. Here X can be independent training and merge training. • BERT E -CDRG-X: replacing the BERT B model in BERT B -CDRG-X with BERT E .   For re-training the MLM task in our two-step approach for CDRG, we employ the Adam optimizer (Kingma and Ba, 2014) with a batch size of 32 and a learning rate of 3e-5. For the two training strategies in Section 4.2, we also adopt the Adam optimizer, in which the learning rate, the dropout rate and the batch size are set to 5e-5, 0.1, 32 after a grid search over the combinations of [2e-5, 8e-5], [0.1, 0.3] and {16, 32, 64}. These hyperparameters are tuned on 10% randomly held-out training data from the source domain. All the experiments are run on a Nvidia GTX 1080 Ti GPU.

Main Results for Cross-Domain ABSA
We report the comparison results on the crossdomain E2E-ABSA task in Table 2, and make the following observations: (1) Comparing Indep Training with all the baseline methods, we can see that solely fine-tuning BERT E on our generated target-domain reviews can already outperform all the existing domain adaptation approaches on average, including our recent unified feature and instance-based adaptation (UDA) method. This demonstrates the usefulness of our CDRG-based approach.
(2) By merging the generated targetdomain reviews with the labeled source reviews, our Merge Training strategy further boosts the average performance of Indep Training, which outper-forms the state-of-the-art UDA approach by an absolute improvement of 3.00% and 2.83% based on BERT B and BERT E respectively. All these observations verify the superiority of our CDRG-based approach over the previous feature and instancebased adaptation methods.
Similar to the results on cross-domain E2E-ABSA, Table 3 shows that our Indep Training strategy obtains indistinguishable performance compared with the state-of-the-art method UDA based on BERT B , and achieves significantly better performance than UDA based on BERT E . Moreover, our Merge Training strategy can consistently achieve the best average performance on the cross-domain AE task, which outperforms UDA by an absolute improvement of 2.63% and 4.47% based on BERT B and BERT E respectively. This demonstrates the general effectiveness of our CDRGbased domain adaptation methods.

Ablation Study of Our Two-Step Approach for CDRG
To investigate the effectiveness of our two-step approach for CDRG, we conduct the ablation study of our target-domain masked language model with constraints (TD-MLM-C) approach, and consider the following variants: (1) BERT E : The pre-trained BERT E model without re-training MLM on the un-  Table 4: Example target-domain reviews generated from BERT E , TD-MLM, and TD-MLM-C. and indicate that the generated target-specific attributes are correct or incorrect. P and N denote the positive and negative sentiment. The blue and red colors refer to the domain-specific aspect terms and opinion terms, respectively. ## denotes that the original word is split into several sub-tokens by the tokenizer of BERT.   First, we propose to verify the closeness between the generated target-domain review set and the real test set from the target domain. Specifically, we employ BERT E to obtain the sentence representation of each review in the two sets, and then compute the distance between the two sets with Maximum Mean Discrepancy (Gretton et al., 2012). Based on the results in Table 5, we can observe that all the three methods can consistently reduce the discrepancy between source and target domains. This shows our two-step approach is generally useful for domain adaptation. Moreover, it is clear that the distance between the review set from TD-MLM-C and the test set is the smallest, which implies the distribution of its generated reviews is closest to the distribution of the target domain.
Second, we further treat the generated reviews as training instances for our Indep Training strategy,  and compare their results on the cross-domain E2E-ABSA and AE tasks. From the results in Table 6, we can see that using the reviews from TD-MLM-C consistently achieves the best results, and outperforms the other methods with a significant margin.

Manual Evaluation on Generated Target-Domain Reviews
Since there is no ground-truth target review for each source review, we randomly select 200 source reviews from L→R and R→L transfer pairs, and manually evaluate the generated reviews in terms of coherence, label correctness, and domain-specific criteria. Based on our manual evaluation, we observe that our TD-MLM-C approach generates better target reviews than its two ablation systems for 152 source reviews; for the remaining 48 source reviews, we cannot determine the winning method, as all the three methods generate meaningful or unmeaningful target-domain reviews. Table 4 shows four representative examples. We can find that BERT E is generally insensitive to the target domain, which may still generate source-specific terms (e.g., touch screen in S2 and flavorful in S3). TD-MLM can produce better reviews, as it tends to convert source-specific terms to the target domain. However, it still suffers from generating non-aspect or non-opinion terms (e.g., the prices in S1 and It in S4). In contrast, with the vocabulary constraints, TD-MLM-C can successfully convert all the source-specific attributes to target-specific attributes in the four cases.
All these observations verify the importance of TD-MLM and vocabulary constraints in our twostep approach for cross-domain review generation.

Conclusion
In this paper, we propose a new domain adaptation paradigm named Cross-Domain Review Generation (CDRG) with applications to the ABSA task. Specifically, we first propose a two-step approach to generate labeled target-domain reviews based on labeled source-domain reviews for CDRG, and then propose two training strategies to leverage the generated reviews for two cross-domain ABSA tasks. Experiments on four benchmark datasets demonstrate that our CDRG-based approaches significantly outperform existing methods for crossdomain E2E-ABSA and cross-domain AE tasks.