Bridge-Based Active Domain Adaptation for Aspect Term Extraction

As a fine-grained task, the annotation cost of aspect term extraction is extremely high. Recent attempts alleviate this issue using domain adaptation that transfers common knowledge across domains. Since most aspect terms are domain-specific, they cannot be transferred directly. Existing methods solve this problem by associating aspect terms with pivot words (we call this passive domain adaptation because the transfer of aspect terms relies on the links to pivots). However, all these methods need either manually labeled pivot words or expensive computing resources to build associations. In this paper, we propose a novel active domain adaptation method. Our goal is to transfer aspect terms by actively supplementing transferable knowledge. To this end, we construct syntactic bridges by recognizing syntactic roles as pivots instead of as links to pivots. We also build semantic bridges by retrieving transferable semantic prototypes. Extensive experiments show that our method significantly outperforms previous approaches.


Introduction
Aspect term extraction (ATE) is a fundamental task in aspect-based sentiment analysis. Given a review sentence "The pizza here is also absolutely delicious.", ATE aims to extract the term pizza. Recent studies define ATE as a sequence tagging task and propose supervised taggers (Wang et al., 2017;Xu et al., 2018). However, due to the high cost of token-level annotation, the lack of labeled data becomes the main obstacle (Chen and Qian, 2019).
To alleviate the data deficiency issue, unsupervised domain adaptation is proposed to transfer knowledge from the labeled source domain to the unlabeled target domain. Since ATE is a tokenlevel task, it is natural to conduct token-level domain adaptation. Then a problem arises: many *Corresponding author. aspect terms are domain-specific and cannot be transferred directly. We present the proportion of source aspect terms that also appear in target test data in Figure 1. As can be seen, in distant transfer pairs like R→L, only less than 10% of source aspect terms have appeared in target data. Even in a close pair L → D, the proportion is no more than 40%. In other words, there is a wide discrepancy between the data from different domains, and many aspect terms have to be transferred under the guidance of proper references.
To solve this problem, previous studies try to associate aspect terms with specific pivot words 1 . We name these methods passive domain adaptation because the transfer of aspect terms is dependent on their links to the pivots. There are two types of methods along this line. (1) Opinion terms as pivots. Since aspect and opinion terms usually appear in pairs, it is straightforward to extract aspect terms with the indication from opinion terms. Early studies (Li et al., 2012;Ding et al., 2017) use common opinion seeds (e.g., good, fancy) and pre-defined rules (e.g., good→amod→NN) to extract aspect terms across domains. However, it is hard to collect a complete set of seeds or define high-quality rules, and thus these methods often produce inferior performance. Several studies Pan, 2018, 2019b) manually annotate all opinion terms in reviews and design neural models to capture aspectopinion relations via multi-task learning. While getting improvements, these methods induce additional annotation costs.
(2) Context terms as pivots. Since pre-trained language models (PLMs) like BERT represent words w.r.t their contexts, recent studies (Xu et al., 2019;Gong et al., 2020) leverage PLMs to transfer aspect terms with common context terms 2 . However, not all context terms qualify as pivots (e.g., eat). In addition, PLMs like BERT build word associations mainly based on semantic similarity in co-occurring contexts. For an aspect term like pizza, BERT tends to link it to hamburger via a flow like pizza→eat→hamburger. Consequently, it is hard for these methods to identify keyboard in the target domain based on the labeled term pizza in the source domain. In this paper, we propose a novel active domain adaptation method. Concretely, we construct two types of bridges for all words, which can help transfer aspect terms across domains. An example in Figure 2 shows how to identify the unseen target term keyboard based on the source term pizza. (1) The syntactic bridge aims to recognize transferable syntactic roles for the words across domains. Though pizza and keyboard have almost no semantic relatedness, they often play a similar role in parse trees. In view of this, we treat the involved syntactic roles (including POS tag and dependency relations) of a certain word as its syntactic bridge. Previous studies also utilize dependency information. However, we differ our method from existing ones in that we do not use dependency relations to associate pivot words with aspect terms. Instead, we treat syntactic roles themselves as pivot features and do not need any manually annotated pivot words.
(2) The semantic bridge moves one step further by retrieving transferable prototypes. Intuitively, if we correlate pizza with some prototype target terms like {disk, OS, mouse}, the domain discrepancy between the training and testing reviews can be largely reduced. Hence we regard the proto-types of a certain word as its semantic bridge and design a syntax-enhanced similarity metric to retrieve them. Compared with previous opinion and context term-based methods, building a semantic bridge directly links aspect terms across domains and only requires unlabeled source and target data.
Based on the syntactic/semantic bridges, we then develop an end-to-end tagger to fuse reviews with these transferable bridges. We conduct extensive experiments on three datasets. The results show that our method achieves a new state-of-the-art performance with a low computational cost.

Related Work
Aspect Term Extraction Early researches for ATE mainly involve pre-defined rules (Hu and Liu, 2004;Popescu and Etzioni, 2005;Wu et al., 2009;Qiu et al., 2011) and hand-crafted features (Li et al., 2010;Liu et al., 2012Liu et al., , 2013Chen et al., 2014). With the development of deep learning, supervised sequence taggers have become the mainstream due to their promising performance (Liu et al., 2015;Wang et al., 2016Wang et al., , 2017Xu et al., 2018;Ma et al., 2019;Chen and Qian, 2020a). More recently, there emerge many studies that interact ATE with other tasks like aspect-level sentiment classification He et al., 2019;Chen and Qian, 2020b). Since these methods highly depend on abundant domain-specific training data, they can hardly scale across the domains where labeled data is absent. Hence it would be more practical to develop unsupervised domain adaptation methods for ATE.
Domain Adaptation Many domain adaptation methods have been proposed to solve coarsegrained tasks like text classification (Blitzer et al., 2006;Ganin and Lempitsky, 2015;Guo et al., 2020). The basic idea in coarse-grained tasks is to transfer pivot words, which does not fit ATE well since most aspect terms are domain-specific nonpivot words. There have been a few attempts to this problem, which fall into two lines. (1) One is to model aspect-opinion relations. Early researches use common opinion seeds and pre-defined dependency link rules to build manual features (Jakob and Gurevych, 2010), conduct bootstrapping (Li et al., 2012), and create pseudo target labels (Ding et al., 2017). Due to the incompleteness of seeds and the inflexibility of rules, they often produce inferior performance. Subsequent studies Pan, 2018, 2019a,b; manually annotate all opinion terms in reviews and design trainable neural models to capture the relations via multi-task learning. However, they induce extra annotation costs. (2) The other aims to find aspectcontext relations. Xu et al. (2019) post-trains BERT on the cross-domain corpus to enhance its domain adaptation ability. Gong et al. (2020) and Pereg et al. (2020) further incorporate external syntactic information into BERT with auxiliary tasks or modified attention mechanisms, but they still rely on the prior knowledge in BERT. These methods often have more than 100M parameters and involve lots of computing power. Unlike all the aforementioned methods, we do not associate aspect terms with pivot words but actively transfer them via bridges.

Methodology
In this section, we first introduce the cross-domain ATE task. We then illustrate how to construct syntactic and semantic bridges. Lastly, we present the bridge-based sequence tagging.

Problem Statement
Given a review x = {x 1 , ..., x n }, we formulate ATE as a sequence tagging task that aims to predict a tag sequence y = {y 1 , ..., y n }, where each y i ∈ {B, I, O} denotes the beginning of, inside of, and outside of an aspect term. In this paper, we focus on the unsupervised domain adaptation for ATE, i.e., labeled training data is not available in the target domain. Specifically, given a set of labeled data D S = {(x S j , y S j )} N S j=1 from the source domain and a set of unlabeled data D U = {(x U j )} N U j=1 from the target domain, our goal is to predict labels y T for the unseen target test data

Bridge Construction
Given a review sentence x from either domain, we map it with a lookup table E ∈ R de×|V | , and generate word embeddings E = {e 1 , ..., e n } ∈ R de×n , where |V | is the vocabulary size, and d e is the embedding dimension. For cross-domain ATE, we construct bridges for reviews to help directly transfer aspect terms across two domains.
Syntactic Bridge In natural language, linguistic expressions are rich and flexible. In contrast, the syntactic structures are limited and are general across domains. Based on this observation, we propose to build connections between source and target words based on their syntactic roles (POS tags and dependency relations) rather than the lexical items. For example, from the parsing results in the upper part of Figure 3, the word pizza with a POS tag NN and dependency relations {det, nsubj} might be an aspect term, while those with the RB tag and advmod relation might not. Note the sentence "The keyboard is in reasonable size." in the target domain has similar parsing results. Hence the syntactic roles can serve as supplementary evidence for recognizing aspect terms across domains. Several prior studies Pan, 2018, 2019b;Pereg et al., 2020) also make use of parsing results. However, they only use dependency relations to link words or to propagate word representations. For example, given a dependency great nsubj −→ pizza in D S , where great is a known pivot and pizza is an aspect term, the goal is to extract keyboard as an aspect from the target review "The keyboard is great" in D T . The typical syntax based method Hier-Joint (Ding et al., 2017) first locates the pivot great, then utilizes the nsubj dependency to identify the term keyboard. Other methods like RNSCN (Wang and Pan, 2018) combine the embedding of the child node (pizza) with that of the parent node (great) according to the relation type, or reversely (depending on the specific design). It can be seen that the dependency relation nsubj here is only used as a link to the pivot. Figure 3: Construction of the syntactic bridge. If a POS tag or dependency relation is involved, its corresponding entry in the vector is set to 1, and otherwise 0.
We start in the opposite direction, i.e., we aim to fully exploit syntactic roles by recognizing themselves as pivots instead of treating them as links to pivots. To achieve this, we present a novel data structure to encode the POS and dependency information by grounding them into involved words. As shown in the lower part of Figure 3, for a word x i , we use a one-hot vector b pos ∈ R Npos and a multi-hot vector b dep ∈ R N dep to represent its POS tag and dependency relation(s), where N pos and N dep are the number of tag/relation types. For b dep , we merge all relations involved with x i regardless of the direction (i.e., being the governor or dependent) 3 .
To enlarge the learning capability, we project b pos and b dep to the same dimensionality with learnable weight matrices 4 and concatenate them to form the syntactic bridge b syn : where b syn ∈ R de has the same dimensionality with the word embedding e. In training, W pos and W dep get trained by labeled samples. In testing, we fix them and obtain b syn for D T . By doing this, our proposed method well preserves two types of syntactic information throughout the entire learning process. As a result, we can take full advantage of their transferable information.

Semantic Bridge
The semantic bridge takes the syntactic roles above as a basis but moves one step further to retrieve transferable prototypes. Unlike previous passive methods that construct information flows like pizza→good→keyboard via opinion terms or pizza→offer→keyboard via context terms, we aim to construct a direct flow like pizza→keyboard. For example, to transfer knowledge from pizza in D S to keyboard in D T , we aim to introduce some supplementary target terms like {disk, OS, mouse} in D U for pizza and directly improve its semantic relatedness with keyboard. We call these supplementary terms prototypes and will retrieve them to build the semantic bridges 5 . PLMs like BERT can find a set of semantically similar terms like {hamburger, salad} for pizza, which can also serve as prototypes. However, such prototypes are not suitable for the domain adaptation task, because aspect terms in one domain are often far away from those in another domain in the semantic space. To address this problem, we design a syntax-enhanced similarity metric to retrieve transferable semantic prototypes.
Before starting, we filter the words in D U by frequency and only preserve those appearing more than τ times. We regard these words in unlabeled target data as candidate prototypes and build a prototype bank V from D U accordingly. We then conduct retrieval following the procedure in Figure 4.
For a query word v ∈ V S (vocabulary of D S ), Figure 4: Construction of the semantic bridge. For a query word, the top-K prototypes are retrieved from the prototype bank and aggregated to its semantic bridge.
we want to find a prototype term v ∈ V that play a similar syntactic role in the target domain. Specifically, we first summarize the global usages of v by merging its POS and dependency embeddings in all reviews where v appear in where | is the dimension-wise OR operation and N S is the number of reviews in D S . Similarly, we can obtain b g pos and b g dep for v. We then define the syntax-enhanced similarity between v and v: where e and e are word embeddings and c(·, ·) is the cosine similarity. Here the POS and dependency similarities are used to find similar syntactic roles, while the word similarity is used to reduce the noise of prototypes 6 . Consequently, we can obtain a s.sim score matrix with their s.sim scores { s k } K k=1 from the prototype bank. Lastly, we aggregate these prototypes into the semantic bridge b sem of v: Following the way for D S , we also retrieve transferable prototypes for D U and D T using V . In this way, source and target words with the same prototypes can be directly correlated to each other. For D U , we can generate a score matrix M U ∈ R |V U |×| V | by calculating the s.sim for all words in D U and all candidate prototypes in V . Then we can obtain the semantic bridge b sem for each word in D U in training. In testing, D T is unseen and the global b g pos /b g dep are not available. Therefore, for a word w in D T , we obtain b sem using M U if w has appeared in D U . Otherwise, we temporarily use the local b pos /b dep of w in current tesing sample to replace the global b g pos /b g dep and calculate the s.sim.

Bridge-based Sequence Tagging
Based on the syntactic and semantic bridges, we now propose a lightweight end-to-end sequence tagger for aspect term extraction. As shown in Figure 5, the tagger receives a mixture of D S and D U for training and then makes predictions for D T in testing. We then illustrate the details. (2) Bridges can help extract domain-invariant information from e i . Therefore, we propose to enhance the embedding e i of a word x i with its transferable bridges b syn,i and b sem,i . Specifically, we use a gating operation to fuse bridges. Take the syntactic bridge as an example, we first calculate a dimension-wise gate g syn,i : where W syn ∈ R 2de×2de , σ is the Sigmoid function, ⊕ is concatenation. We then scale the concatenated vector e i ⊕ b syn,i with g syn,i and obtain the syntactic bridge enhanced embedding e syn,i : where is an element-wise multiplication. The semantic bridge enhanced embedding e sem,i can be calculated similarly. We term the model with e i , e syn,i , and e sem,i input as BaseTagger, Syn-Bridge, and SemBridge, respectively. Three types of embeddings are collectively called e input,i .
Feature Extractor Previous studies (Xu et al., 2018) show that low-level token features are insufficient for tagging terms. Therefore, we use a CNN encoder containing L stacked convolutional layers with ReLU activation to extract the high-level where K ∈ R d f ×(d input ×ks) is the kernel group, ks = 2c + 1 is the kernel size.
Token Classifier For recognizing aspect and opinion terms, we send f L i in the last layer to a token classifier: whereŷ i is the prediction of the word x i .
Then we add a Gradient Reversal Layer (GRL) (Ganin and Lempitsky, 2015) to f g with the scale coefficient λ and train a domain classifier to distinguish the domain that f g belongs to: whereŷ d is the domain prediction, and M LP contains L D layers with ReLU activation.
Training Procedure In training, only samples from D S have corresponding BIO labels y S for token classification. The goal is to minimize the tagging loss for recognizing aspect terms: where is the cross-entropy loss function. On the other hand, the samples from D S and D U are used to train the domain classifier and minimize the following domain classification loss: where y d = 0 for D S and y d = 1 for D U . The final loss for training the end-to-end tagger is defined as L = L BIO + L DOM . Notice that D T is only used in testing. There is no data leakage in training, and the task setting is strictly inductive.

Experimental Setup
Datasets We use three conventional English datasets from different domains and construct six directed transfer pairs, where R and L are from Se-mEval 2014 and 2015 (Pontiki et al., 2014(Pontiki et al., , 2015, and D is collected by Hu and Liu (2004). Following previous studies Pan, 2018, 2019b;Pereg et al., 2020), we use three different splits and each split has a fixed train-test ratio 3:1. The detailed statistics of datasets are presented in Table 1 7 .  Settings We pre-process each dataset by lowercasing all words. We use the same word2vec vectors as previous studies Pan, 2018, 2019a,b) to generate word embeddings, and set the dimensionality d e =100. In the syntactic bridge, we use Stanford CoreNLP (Manning et al., 2014) for dependency parsing. There are 45 classes of POS tags and 40 classes of dependency relations in three datasets. In the semantic bridge, we set the frequency threshold τ =5, the number of prototypes K=10. In the end-to-end tagger, we set the number of convolution layers L=4, and the kernel size ks of each layer is 3, 5, 5, 5, respectively, the number of MLP layers L D =3, and dropout (Srivastava et al., 2014) is applied to layers' outputs with the probability 0.5. The dimensionality of features d f =256, the scale coefficient of GRL λ=0.1. We train the tagger for 100 epochs using Adam optimizer (Kingma and Ba, 2015) with the learning rate 1e-4 and batch size 8 in a 1080Ti GPU. Evaluation For each transfer pair, we use the labeled training data from the source domain and unlabeled training data from the target domain to train the tagger. Then we evaluate the tagger on unseen test data from the target domain. We use the mean F1-scores of aspect terms over three splits with three random seeds (i.e., nine runs for each transfer pair) for evaluation 8 .

Compared Methods
We classify all models into three categories. Type-I denotes the opinion term-based methods. TCRF (Jakob and Gurevych, 2010), RAP (Li et al., 2012), and Hier-Joint (Ding et al., 2017) (Gong et al., 2020) and SA-EXAL (Pereg et al., 2020) incorporate syntactic information into BERT with auxiliary tasks and modified attention mechanisms 9 . Type-III denotes the proposed active domain adaptation strategy. BaseTagger is the tagger without bridges, while SynBridge and SemBridge use syntactic and semantic bridges, respectively.

Main Results
The comparison results for all methods are shown in Table 2. It is clear that our proposed model achieves a new state-of-the-art performance in terms of the average F1-scores. For example, Sem-Bridge outperforms the best TIMN in Type-I by 7.02% and BERT-Cross in Type-II by 5.21%, respectively. We also notice that our BaseTagger already outperforms all baselines. We attribute this to the design of CNN feature extractor and domain adversarial training (DAT). CNN focuses on the Ngram feature rather than a single word and reduces the side effects of non-pivot aspect terms. DAT is applied to the sentence-level features, such that they are not misled by the common N-grams that are labeled both 0 and 1.
SynBridge and SemBridge further improve Base-Tagger with a 1.80% and 2.68% absolute gain, respectively. This proves the effectiveness of our proposed active domain adaptation strategy. Meanwhile, SemBridge is a bit superior to SynBridge. The reasons are two-fold. (1) The semantic bridges come from prototype words that possess prior embedding knowledge and also contain syntactic information, while the syntactic bridges are merely trained from scratch. (2) The retrieved top-K terms make the supplementary information in SemBridge more diverse and abundant than that in SynBridge.
Among the baselines, early methods using common opinion seeds and pre-defined rules are inferior. Relying on annotated opinion terms, the methods like TIMN get some improvements but induce extra annotation costs. By incorporating pre-trained BERT with external dependency and cross-domain corpus, UDA, SA-EXAL, and BERT-Cross outperform previous methods, but they need high computational resources. In contrast, by using the static Word2vec embeddings, our model can outperform those with dynamic BERT representations. This is instructive for other researches in that there is still room for improvement by exploring the syntactic and semantic features beyond the popular BERT-based models 10 .

What If There Is an OTE Task?
With the proposed active domain adaptation strategy, we do not need any manually labeled opinion terms for ATE. However, this does not mean that our method cannot handle opinion term extraction (i.e., OTE). In contrast, if the labeled opinion terms are provided in D S , we can also conduct the OTE task for D T by simply modifying the tagger. In specific, we add an opinion term prediction layer in Eq.8 and then extract aspect and opinion terms simultaneously. The results are shown in Table 3.
Obviously, our method again outperforms all baselines 11 . We find a small performance decrease in AVG-AS compared with that in Table 2. Similar results are also observed in BERT-Base. The reason is that the objective of ATE and OTE may interfere with each other without proper balancing and a sophisticated multi-task learning framework.

Ablation Study
We conduct a series of ablation study to validate the effectiveness of our method. The results are shown in Table 4. Results 1∼2 conform to our previous discussion about BaseTagger that both CNN and domain adversarial training contribute to overall good performance. Results 3∼6 show the effectiveness of POS and dependency embeddings in SynBridge. Specifically, in 5∼6, we replace our proposed structure for dependency with frequently-used Tree-LSTM and GCN to model the dependency tree and find a significant drop in performance. Results 7∼9 show the importance of all three types of similarity for retrieving prototypes in SemBridge.

Parameter Study
There are three key hyperparameters in our method: the scale coefficient of GRL λ, the frequency threshold τ , and the number of prototypes K. We vary λ in the range 10 −4 ∼ 1.0 and τ /K in 1 ∼ 10 to investigate their impacts and present the results in Figure 6.
In Figure 6(a), when increasing λ from 10 −4 to 10 −1 , we enlarge the scale of domain adversarial training in GRL and get small improvements. However, the performance does not keep rising when (a) Impact of λ.
(b) Impact of τ /K. Figure 6: Impacts of hyperparameters λ, τ , and K. λ = 1.0. This result shows that simply forcing non-pivots to transfer knowledge is not suitable for domain adaptation. In Figure 6(b), τ is used to balance diversity and accuracy. A low τ means that prototypes are diverse, but some of them are long-tail words and contribute little to the reduction of domain discrepancy. On the contrary, a high τ only preserves frequent prototypes, and some meaningful prototypes are filtered out. Therefore, a middle τ =5 is an appropriate choice. For K, the curve is generally upward when more prototypes are introduced. This trend is reasonable since more prototypes equal to more target information.  In Figure 7, we further analyze the impacts of the percentage of unlabeled data P U and the percentage of parsing noise P N . For P U , the performance is generally better when more unlabeled target data is introduced. Moreover, around 20%∼40% unlabeled data is enough to achieve satisfactory performance. Notice that SemBridge without unlabeled data will degenerate into BaseTagger since no prototypes can be retrieved. For P N , we manually disturb the parsing results to observe the robustness of our method. Clearly, after introducing noises on parsing, the performance begins to degrade, but not by a large margin. Our method has the ability to resist parsing errors for two reasons. First, beyond syntactic roles, we also incorporate embedding similarity when retrieving prototypes (for SemBridge only). Second, the gating mechanism can further filter useless syntactic information and maintain the quality of word representations.

Case Study
To have a close look, we select a few samples from testing target data for a case study. S1 and S2 show the positive impacts of bridges. Due to the space limit, we illustrate S1 in detail. Since most words in S1 are domain-specific terms in L, RNSCN fails to recognize any aspect terms by simply propagating word representations with dependency. BERT-Cross only extracts a part of aspect terms based on its prior knowledge. For our bridge-based method, SynBridge supplements syntactic roles {nummod, compound, obj, conj, NNS} for port. These syntactic roles also join the representation of usb and help to extract usb ports correctly. For SemBridge, the analysis is much straightforward. usb is the prototype of typical aspect terms in R like {garlic, thai, banana}, thus the tagger with semantic bridges can easily recognize usb as an aspect term. S3 further illustrates how SemBridge helps recover from the wrong parsing results. Such results make two syntax based methods RNSCN and Syn-Bridge stop working. In contrast, tuna is the prototype of noun words like {nvidia, amd, blade} in L and melt has the verb prototype like {imagine, hang, relax} in R, thus SemBridge correctly extracts tuna and filters out melt in the same time.
In Table 6, We further present several sample prototypes of the training data from the transfer pairs R→L (upper three) and L→R (lower three) in SemBridge, where three terms on the left are aspect term, opinion term, and context term, respectively. For a source non-pivot term like processor in L, SemBridge enhances it with typical target words like soup and burger. As a result, the domain discrepancy between the source and target data is largely reduced with the help of prototypes.

Analysis on Computational Cost
In practice, for any transfer pairs, the one-time construction of syntactic and semantic bridges can finish within 30 seconds. Therefore, we focus on the end-to-end training costs of SynBridge/SemBridge. We run five top-performing methods on the transfer pair R→L and present the trainable parameter number and running time per epoch of each method in Table 7. We can conclude that our proposed method maintains a quite low computational cost.

Conclusion
In this paper, we propose a novel active domain adaptation method for aspect term extraction. Unlike previous studies that conduct passive domain adaptation by associating aspect terms with pivots, we actively enhance the terms' transferability by constructing syntactic and semantic bridges for them. We then design a lightweight end-toend tagger for bridge-based sequence tagging. Experiments on six transfer pairs demonstrate that our method achieves a new state-of-the-art performance with a quite low computational cost.

B Modification of SAL and UDA
Since SAL and UDA are designed for end-to-end cross-domain aspect-based sentiment analysis, they have access to the aspect sentiment labels in training. As previous studies show, aspect term extraction and aspect-level sentiment classification can benefit each other. Therefore, it is unfair to directly compare our method with SAL and UDA.
We choose to modify SAL and UDA and make them fair competitors. We degrade the collapsed tags {B-POS, I-POS, B-NEG, I-NEG, B-NEU, I-NEU, O} to {B, I, O} thus remove the aspectlevel sentiment classification task. Following other BERT-based methods, we use BERT-Base as the backbone of UDA.

C Can We Combine SynBridge and
SemBridge?
Since SynBridge and SemBridge contain transferable syntactic and semantic information, it is intuitive to combine them for a better performance than either individual model. Here we apply a very simple operation for combination. For a word x i with embedding e i , we first obtain its syntactic and semantic bridges b syn,i and b sem,i , and merge them into a combined bridge: bcom,i = (Wsyn × bsyn,i) + (Wsem × bsem,i), (13) Then we conduct a similar gating operation and get the combined bridge enhanced embedding e com,i : gcom,i = σ (Wcom(e i ⊕ bcom,i)) ecom,i = gcom,i (e i ⊕ bcom,i), Lastly, we regard e com,i as the input of tagger and make predictions for aspect terms. We term this model ComBridge and present the results in Table 10. ComBridge slightly outperforms SemBridge and achieves the optimal results in all bridge-based methods. The small improvement is explicable since SemBridge already contains most of the syntactic information in SynBridge and we do not use any sophisticated methods in combination.

D Detailed Results for an Additional OTE Task
When opinion terms are labeled, our method can also conduct aspect term extraction and opinion term extraction simultaneously. For recognizing aspect and opinion terms, we only need to add an opinion term prediction layer: whereŷ a,i /ŷ o,i are the predictions of {B, I, O} for the aspect / opinion terms. And the resulted BIO loss is calculated as follow: (ŷa,i, ya,i) + (ŷo,i, yo,i) where is the cross-entropy loss function. We present the detailed results in Table 9. Obviously, our proposed SynBridge and SemBridge outperform other baselines in both aspect and opinion F1-scores.