Domain Generalization via Switch Knowledge Distillation for Robust Review Representation

,


Introduction
With the proliferation of social media, online shopping, and related activities, users are learning to provide an increasing number of reviews about the products they consume (Palmisano et al., 2008;Gauch et al., 2007).With the deluge of customer reviews available, the sentiment score of *Corresponding Authors.each customer review can provide implicit feedback for content-based collaborative filtering (Lu et al., 2015).
To learn review representation, previous studies have sought to incorporate external user and product (UP) information into sentiment analysis, which aims to build a neural model by learning contextual and external UP features to predict rating scores (Tang et al., 2015;Chen et al., 2016;Wu et al., 2018;Zhang et al., 2021c,b).The main idea of these models is to inject UP as external feature vectors into sentiment classifiers.These methods can be broadly categorized into two groups according to their injection strategies, i.e., bias-and matrix-based injections.Bias-based methods render bias terms in classifier parameters, while matrix-based methods render matrix terms.Bias-based methods typically perform weaker than matrix-based methods.However, such a method is hard to optimize and cumbersome (Amplayo, 2019).It regards these models as in-domain (ID) distribution for specific users, which collects historical reviews of users on different products and uses them for recommendations.
Due to privacy preservation concerns, some users prefer to use anonymity to bypass the recommender system, which may understand users' intentions through historical interaction.Unfortunately, applying neural models injected with ID UP information to learn review representations of unseen or anonymous users as out-of-domain (OOD) distributions may degrade the classification performance for reviews.As shown in Figure 1, with the ID UP information, neural models could accurately classify positive and negative review samples.However, the performance drops when they face samples with anonymous UP even though textual information is provided.
We hypothesize that degradation occurs mainly because learning review representation depends heavily on external domain-specific features (F s ),

Domain Generalization Problem
User: WWxCMDn8rVHIrIFoKRcRDg Product: _oQN89kF_d97-cWW0iO8_g Rating: Not applied Text: loved it !unique ecuadorian/ peruvian/south american food with good flavors...

User: WWxCMDn8rVHIrIFoKRcRDg
Product: _oQN89kF_d97-cWW0iO8_g Rating: 4 Text: loved it !unique ecuadorian/ peruvian/south american food with good flavors... Applying an extra plain-text model for unseen UPs can generate review representation while dissociating UPs from the recommender system.The proposed method masks out UP information with a generalized UP for sample augmentation.Moreover, a KD strategy facilitate the generalization-domain representation to learn only domain-invariant features from review representation injected specific UPs.Finally, sentiment models can effectively handle with reviews from historical or unseen UPs in inference.such as UP, while ignoring domain-invariant features (F i ).As a result, the performance of the trained model for unseen or anonymous users is even lower than that of existing models applied to plain texts, i.e., sentiment models using only review text data.A feasible solution is to learn a domain generalization (DG) model using data from UP information as multiple source domains and then distilling F i that can be generalized to unseen or anonymous users (Wang et al., 2022;Zhou et al., 2022).
Several strategies have been recently proposed to address DG challenges in wider applications, such as image understanding (Krizhevsky et al., 2017) , speech recognition (Hinton et al., 2012) , and natural language processing (Sarikaya et al., 2014) .The main idea is to eliminate domain shifts and preserve F i for robust generalization in the OOD data distribution (Lee et al., 2022) .Nevertheless, the previous studies are not eligible for direct application since the primary change in data distribution occurs in external injected features instead of the review contents.
In this study, we introduce a knowledge distillation (KD) strategy (Gou et al., 2021) with a generalization-switch (GSwitch) module to distill F i in review representation for robust generaliza-tion.The main idea is to predefine a generalization domain (GD) distribution that preserves F i while eliminating F s either in ID or OOD distributions for DG.The GSwitch module simulates a GD distribution by initializing the original UP as zeros, which masks out ID information at the input level.Moreover, the GSwitch module provides ON and OFF statuses that easily converts ID or GD representations to each other by turning status.To maximize the mutual information between review representation in ID and GD (Krause et al., 2010;Seo et al., 2022) , a switch KD (SKD) is proposed, where bidirectional Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) is utilized to measure domain gaps.Due to the removal of ID information in GD review representation, only F i is preserved under KD supervision.Review representation in the GD distribution performs more sufficiently in terms of F i , and better generalization of sentiment models can be leveraged for OOD.
Extensive experiments were conducted on IMDB, Yelp-2013 and Yelp-2014 by masking out UPs in test data for simulating unseen UPs.The results show that the proposed method outperforms several baseline models when anonymous users appear.It does not force learning F s for OOD representation and preserves F i for unseen or anony- mous UPs.The remainder of this paper is organized as follows.A detailed description of the proposed method is illustrated in Section 2. Extensive experiments and analyses are conducted in Section 3. Conclusions are drawn in Sections 4. Finally, the limitations of our paper are discussed in Section 5.

Switch Knowledge Distillation
In this section, we elaborate on the GSwitch module and SKD for DG problems in personalized sentiment analysis tasks, as shown in Figure 2. The GSwitch model was proposed to adopt GSwitch modules to inject ID knowledge of UP information into review representation and mask out ID knowledge to simulate GD distribution.SKD is a bidirectional KL divergence-based strategy to enhance GD representation with sufficient F i from ID distribution.In practice, unseen or anonymous OOD knowledge in the testing data can also be served with GD representation for DG tasks.Moreover, such OOD knowledge can be further updated for domain-specific knowledge if sufficient domain data are available.

Domain Generalization
In our DG tasks, there are multiple N source domains with access to the training set D s = {D s 1 , D s 2 , . . ., D s N } where N denotes the amount of ID knowledge.
The ith dataset where K and X denote the input spaces of knowledge and review texts, respectively, and Y represents the rating space.The aim is to learn a classification function f (•; θ) : (K, X ) → Y using all source domain data but can also generalize to unseen target domains D t = {(k t , x t )} where KXY , ∀i ∈ {1 : N }.The objective is formulated as follows: where L CE is the cross-entropy loss applied to each source domain sample and L MIM presents the mutual information maximization loss applied to multiple source domains and the corresponding GD distributions with a decay factor β (Krause et al., 2010;Seo et al., 2022).In our work, we introduce a KD loss L SKD for instantiating L MIM ; see Sec. 2.3.The learned parameters θ are evaluated on the OOD data in inference.

Generalization Switch Module
In contrast to the previous DG problem, ID knowledge is explicitly exposed as the input of k ∼ K in our work; therefore, such ID knowledge can be injected into the text representation of x ∼ X to build the joint distribution of P KX via knowledge injection methods in most current models (Zhang et al., 2021b).Conversely, F s can also be unloaded from ID models to enhance their generalization performance with the preservation of F i (Lu et al., 2022) .To this end, we rethought the previous works and proposed the GSwitch module with two statuses of ON and OFF.
Given a textual representation of H x and ID knowledge embeddings of H k as inputs, the GSwitch module aims to fuse both representations to generate ID textual representation H ′ x when its status is ON, as shown in Figure 3: where H x ∈ R Lx×dx can be all possible inner textual representations in text encoders, e.g., query, key, and value vectors at multihead attention (MHA) in the transformer structure (Vaswani et al., 2017); for weights and biases, H k ∈ R d k with the dimensionality of d k are linearly transformed into d x consistent with H x via linear functions of Linear 1 (•) and Linear 2 (•); L x represents sequence length; and ⊙ denotes the Hadamard product function (Zhang et al., 2021b,c).The output of the GSwitch module updates original textual representations to knowledge-enhanced ones.When the status is OFF, the highway of the GSwitch module deactivates ID knowledge as H ′ x = H x .Therefore, the GSwitch module is agnostic to the sentiment model structure.
To further eliminate gaps between the two statuses, we technically set H k to zero when UPs first participate in the training procedure and reformulate the first term in Eq. ( 2) as 1 + linear 1 (H k ).In such a way, when sentiment models handle unseen UP, the GSwitch module does not affect information propagation either in the training or the inference phase.Such zeroed UPs indicate predefined GD knowledge.This characteristic evolves the GSwitch module as a hot plugin component, especially empowering pretrained language models (PLMs) with well-trained checkpoints.As a result, the GSwitch model (OFF) performs as well as the GSwitch model (ON) with UPs in GD distributions.
In the training phase, ID models gradually generate ID review representations from the GD distribution as starting points.In inference, ID models encode OOD data as well as GD data since OOD knowledge is zero-initialized as GD knowledge.Therefore, GSwitch models (ON) as ID models could generate GD representation by turning their switch status to OFF.
Input: Sentiment model M Sentiment with GSwitch modules M GSwitch , Source data D S , confidence threshold ε, and SKD decay factor β.
1: Initialization M Sentiment randomly initialized or loaded from well-pretrained checkpoints, k S is zeros.2: for iter in [0 : max_iter] do 3: Sample a batch from source data 4: Generate q ID via M Sentiment and M GSwitch (ON) 5: Generate q GD via M Sentiment and M GSwitch (OFF) 6: Obtain L SKD via Eq.( 4).

7:
Update Model M Sentiment and GSwitch modules M GSwitch via Eq.( 1).et al., 2019), and transform-based PLMs (Qiu et al., 2020), i.e., BERT (Devlin et al., 2019), equipped with the GSwitch module inherits the ON and OFF status.When the switch status is ON, ID review representation is classified with the softmax function, ID logits, q T ID = f (x, k; θ) ∈ R d rating is generated from texts and ID knowledge, where d rating is the dimensionality of predicted labels and T is the temperature applied to soften predicted distributions.In contrast, when the switch status is OFF, GD logits q T GD = f (x; θ) ∈ R d rating are generated from only texts, which is the same as q T ID = f (x, k ′ ; θ) ∈ R d rating with the generalization knowledge of zeroed k ′ .

Optimization
Training objective.In the training phase, the crossentropy loss between q 1 ID and y is first applied to multiple source domain data with supervision.To generate F i , we introduced q T GD logits for q T IN eliminating F s in review representation.However, q T GD might not perform well due to a lack of supervised objectives.To address this issue, we introduce a bidirectional KD, namely, SKD, to guide q T GD to learn F i from q T IN .A detailed procedure is listed in Algorithm 1. Switch knowledge distillation.To measure the distance of representation between ID and GD distributions, KL divergence is commonly used, formulated as: where q t presents the teacher probability to guide the convergence of the student probability q s ; i indexes the ith sample in the source domain data.
It is typical to assign q T ID as the teacher probability since it is empowered with personal knowledge and direct golden-label supervision.We find that it still works with the teacher assignment for q T GD where the insightful assumption is that applying a penalty term for better generalization as well as improving the relatedness between ID and GD distributions (Ryu et al., 2022;Sun et al., 2022) .As shown in Sec. 3, it is more robust to combine both assumptions.Therefore, the loss of SKD is defined as: where B s denotes a batch of samples in the source data; ω ∼ B(0.5) is a random variable sampled from the Bernoulli distribution with a probability of 0.5; and F is a confidence filter to select only reliable predictions to avoid noisy knowledge distillation, defined as: where ε is the confidence threshold.

Experiments
To investigate the effectiveness of the proposed methods, extensive experiments were conducted on the review sentiment classification task.

Datasets and Evaluation Metrics
Datasets.We evaluate our method on three personalized sentiment analysis datasets as benchmarks, including IMDB, Yelp-2013, and Yelp-2014 (Tang et al., 2015;Zhang et al., 2021b 1. Metrics.Due to all datasets exhibiting unbalanced distribution over ratings, we further adopted Macro-F 1 (F 1 ) as an additional metric along accuracy (Acc) and rooted mean squared error (RM SE) following previous works on personalized sentiment analysis (Tang et al., 2015;Zhang et al., 2021b).

Implementation Details
Network architecture.Following previous works, we chose well-known neural sentiment classification (NSC) (Cheng et al., 2016) and BERT (Devlin et al., 2019) models as backbone architectures.The attention mechanism (Chaudhari et al., 2021;Yuan et al., 2022) showed high performance in sequence modeling and was selected as a priority injection target in most of the previous state-of-the-art works.Accordingly, we evaluated our injection method, the GSwitch module, primarily on the attention mechanism in NSC and BERT, i.e., hierarchical attention and MHA, namely, GSwitch-NSC (att) and GSwitch-BERT (qkv).Since our method is agnostic to model architecture, more complex injection strategies applied to diverse NNs could be used, as shown in Appendix A.3.Hyperparameter setting.For GSwitch-BERT, we used the BERT-base-uncased version as initial checkpoints in all experiments, available at Hug-gingFace1 .With respect to personality knowledge, d k is set to 256.In terms of SKD, ε and β were selected as 0.3 and 1, respectively.For optimization in GSwitch-BERT, the learning rate was set to 2e-5, the batch size was set to 6 with an acceleration ratio of 4 (virtual 24), and AdamW with a linear schedule was applied.With respect to GSwitch-NSC, the learning rate and batch size were set to 5e-4 and 32, respectively.An early stopping strategy with a patience of 3 epochs was adopted for better generalization and monitoring of the F 1 scores of the Dev set D dev .All models, in our work, were implemented with PyTorch framework and experiments were conducted on a single RTX 3090 (24G)   GPU device.

Comparative Results and Discussion
Tables 2 and 3 show the comparative experiments on all three datasets, where In terms of the plain-text scenario in Table 2, it can be first found that all models achieved comparative results on three metrics.In part, NSC and BERT performed better since the hierarchical structure and pretrained-finetuning learning strategy were introduced, respectively.However, these models might be suboptimal because knowledge- potential information was not sufficiently extracted.
In a DG method, the proposed models GSwitch-NSC-DG (att) and GSwitch-BERT-DG (qkv) learned performed better in the OOD scenario than plaintext models.
In terms of ID scenarios, we compared the proposed methods with the previous state-of-the-art ID models.Initially, ID models outperformed the first two groups, demonstrating that the introduction of UP information is beneficial to encode reviews.The proposed methods were on par with previous corresponding state-of-the-art models such as GSwitch-BERT-ON (qkv) vs. MA-BERT and GSwitch-NSC-ON (att) vs. NSC+UPA and achieved the best performances in F 1 , revealing the effectiveness of the proposed GSwitch modules, where ablation studies can be found in Sec.3.5.
To further analyze insight DG in personalized sentiment analysis, Table 3 reports clear comparisons with respect to GSwitch-BERT and GSwitch-NSC.From the table, it can be found that although GSwitch models (ON) have achieved better results in comparison with the models (OFF), they failed to be directly applied to OOD scenarios with performance degradation (∇O column).With the introduction of the DG method, such degradation vanished, and the performances on both ID and OOD were leveraged, indicating that F i generated from multiple source domain data is beneficial for predictions in OOD data.Unfortunately, GSwitch-BERT (DG) revealed a slight descending trend in I of IMDB after DG but better performance in a holis- tic scenario.This finding may be somewhat limited by the relatively complex IMDB task, which covers comprehensive review representations and largerange ratings, leading to overgeneralization applied to ID feature learning.This situation further suggests that caution must be applied in practice.
The findings of both tables can be twofold in sentiment analysis: 1) review representation encoded from only texts can be leveraged via KD adopting ID models as teachers.2) ID models can be improved by preserving GD review representation with sufficient F i for robust generalization performance for unseen UPs.

Effect of SKD
To evaluate the proposed SKD for DG problems, we conducted several experiments for analysis.
First, to investigate how the combination of bidirectional KDs generates robust representation, we fixed ω to 1 or 0 for comparisons.term applied to ID models.Furthermore, the table presents the performance when back-propagate gradients of teacher logits were detached in KL divergence in Eq. ( 4).Generally, undetached teacher logits were relatively higher than detached logits.
With the combinations between both directions, SKD achieved the best results, indicating its effectiveness.
To further investigate the sensitivity of SKD parameters, Figure 4 illustrates the performance of GSwitch-BERT-DG (qkv) on Yelp-2013 datasets with various crucial parameters, i.e., confidence filter threshold, decay factor, and temperatures.The upper two figures show that when either a larger or lower confidence threshold ε dropped GSwitch-BERT-DG (qkv) performances and appropriate loss decay factors achieved salient balances in both scenarios.The lower figures depict in detail the performances of DG methods in ID and OOD scenarios in inference, as well as their average.With the difference in temperature between q ID and q GD , the performance of the proposed method differed in the ID and OOD scenarios.Accordingly, a larger temperature in q ID than in q GD produced higher performance in the ID scenario and lower performance in the OOD scenario and vice versa.These findings suggest flexible applications according to requirements in practice.

Effect of GSwitch Modules
GSwitch modules were proposed to be a unified method that rethought the previous works and efficiently model the ID and GD review representation.Table 5 presents an ablation study on weight and bias terms.For the three datasets, GSwitch-BERT-ON (qkv) and GSwitch-NSC-ON (att) achieved the best results.Once either weight or bias terms vanished, the performance dropped concurrently, indicating the effect of GSwitch modules that com-bined matrix-and bias-based injections.
A clear correlation between injection places and purposes can be surveyed in previous works.In our work, the GSwitch module performed a flexible knowledge injection, and Table 6 presents comparative results with different injection places.It can be found that, with different places to inject, all GSwitch models with the status ON could achieve better results than OFF while failing in OOD scenarios.Meanwhile, the DG method could overcome such failure by building the relatedness between ID and GD distributions, in accordance with previous observations (Sec.2.3).In particular, when GSwitch modules were injected into submodels with more robust capabilities to model hidden representation, more performance could be leveraged.As seen in Table 6, injection places of MHA, feedforward NNs, and hierarchical attention revealed higher F 1 scores than other places, consistent with other studies (Chen et al., 2016;Wu et al., 2018;Zhang et al., 2021c).
We also listed the further performances of various models with possible injection places to reveal the flexibility and effectiveness of our works, as shown in Appendix A.3.

Conclusions
In this paper, a DG framework with knowledge distillation was proposed to generate robust review representations for sentiment analysis.Rethinking the previous state-of-the-art models, we introduced GSwitch models that connect review representations between ID and GD distributions.To align both representations for sentiment classification, an SKD was proposed, which enables ID models to preserve F i for better generalization on OOD data.Comparative and analytical experiments indicate the effect of GSwitch models and demonstrate that the proposed DG can effectively eliminate domain shifts in sentiment analysis.

Limitations
There may be some potential limitations to this work: • Due to the maximum input length limitations and cumbersome deployments in most PLMs (i.e., BERT), we limited our input lengths with a specific selector (following previous works (Sun et al., 2019;Zhang et al., 2021b)) and searched hyperparameters in a limited range, especially in batch sizes (with a maximum batch size of 6).Theoretically, better experimental results can be reported; however, we reimplemented comparative methods and conducted all analytical experiments in the same environments with the same settings, ensuring fairness in performance comparison and problem addressing.
• Due to the characteristics of the applications in our work and the existing DG methods that are difficult to directly apply to our tasks, we only simulate the performance of plain-text models as DG benchmarks.However, textual information is inherent in UP-invariant signals for DG performance to some extent, and a comparative experiment indeed leverages the proposed method for better performance in the same OOD scenarios; therefore, it is reasonable for evaluating the performances.To further address these limitations, we will explore more DG strategies to adapt feasible DG methods applied to our personalized sentiment analysis or more complex scenarios with the external introduction of inherent domain shifts in texts such as topics (e.g., books, DVDs, electronics, and kitchen appliances).
• Last but not least, in this paper, the proposed DG method is only evaluated on personalized sentiment analysis tasks.However, more applications can be applied to our method, where domain shifts occur due to explicit knowledge injection or F i can be augmented and exposed.

A Appendix
A.1 Related Work Review sentiment analysis.Review sentiment analysis has attracted increased interest with the utilization of complex review information.Recent studies have proven the effectiveness of the introduction of metaknowledge (i.e., users and products) along with texts via knowledge injection methods.
Most of the previous injections can be categorized into matrix-and bias-based methods.Matrix-based methods generally reshape knowledge representations that are initialized or generalized from UP information and then substitute existing submodel weight parameters in text-plain models (Tang et al., 2015;Amplayo, 2019;Zhang et al., 2021b,c).Biasbased methods utilize knowledge representation as biases to be added in text hidden states (Chen et al., 2016;Amplayo et al., 2018;Wu et al., 2018).
The final purpose of both categories is to produce UP-specific review hidden states for robust review representation.
In our paper, we combine matrix-and biasbased injections in an efficient way and propose a GSwitch module, which effectively connects to ID and GD review representations to further interact for robust review representation.Domain generalization.The main aim of DG is to learn a model using data from multiple domains that can then be generalized to unseen domain data (Zhou et al., 2022;Wang et al., 2022).For this purpose, most of the existing approaches mainly contain data augmentation (Kang et al., 2022), F i learning (Lu et al., 2022), and meta-learning techniques (Balaji et al., 2018).These models focus on applications such as image understanding (Krizhevsky et al., 2017), speech recognition (Hinton et al., 2012), and natural language processing (Sarikaya et al., 2014).
In contrast to most application scenarios in which F s and F i are inherently fused as input data, such as cartoon and sketch images with domainspecific styles and domain-invariant contents, in our work, F s is mainly injected as external knowledge, and F i is primarily located in the text contents themselves.Our work could belong to F i learning and data augmentation, where we augment ID review representations to GD ones and then apply the KD strategy to guide GD representations to learn from ID distribution.Since GD representation is agnostic to F s , only F i is preserved.Knowledge distillation.KD technologies gener-  ally transfer knowledge learned from cumbersome teacher models to small student models (Hinton et al., 2015;Gou et al., 2021).It has also been used for other purposes, such as metric learning (Kim et al., 2021), network regularization (Xu and Liu, 2019), and domain adaptation (DA) (Farahani et al., 2021).In particular, for DA, such methods transfer robust knowledge from teacher models learned from source domains to student models applied to target domains.Recently, DA has flexibly introduced distill knowledge methods to further handle specific scenarios where catastrophic forgetting occurs in BERT-based DA (Ryu et al., 2022), and a source-trained model instead of source data is adapted to the target domain for safety in source data (Zhang et al., 2021a).Unfortunately, these KD methods require access to target domain data, while DG tasks serve unseen target domains.
Different from the previous work, we explored KD to transfer F i in ID review representations to GD review representations on only source domain data where ID review representation comprises fused information of F s and F i while GD review representation eliminates F s .

A.2 Connections to Previous Models
To formulate the injection methods applied to fusion textual and nontextual representation, we mainly survey the previous works in twofold, including bias-and matrix-based methods.First, we formulate encoding procedures of texts and personalities as follows: where x and k present text representation and knowledge embeddings, respectively; x ′ and k ′ present encoded representations; f generally presents a set of language encoding models, i.e., multiple layer perceptron (MLP), CNN, LSTM, and transformer with matrix-and bias-shape parameters (W x and b x ); g presents a simple linear project or a direct copy (k ′ = k) to product knowledge representation.Next, injection methods to generate knowledge-specific representation can be formulated as follows: x where • denotes joint operations, and the injection operation is at the creation of W ′ and b ′ .
Bias-based injection.It basically updates the original bias term b x by: where reshape(n, m.shape) is a function of reshaping the parameter n with the exact shape of m in a preliminary setting that there is the same number of parameters between n and m.
Based on this assumption, most previous methods (Chen et al., 2016;Wu et al., 2018) take f in Eq. (7-8) as linear projections in self-attention mechanisms to generate knowledge-specific attentive maps for classification.Matrix-based injection.Another method is to generate knowledge-specific weights via: However, this method might be burdened by many large parameters (Amplayo, 2019), hart to optimize, and with a lack of interactions between textual encoder parameters and knowledge representation (Zhang et al., 2021b).To address these limitations, several works optimize such injections, for instance: 1) CHIM-based method (Amplayo, 2019) where repeat(n, M ) means repeating parameters n along the corresponding dimensionalities of M .
2) MA-BERT (Zhang et al., 2021b) Both CHIM and MA-BERT can efficiently inject knowledge representation into text encoders with further interactions.In this paper, the proposed method combines Eq. ( 8) and ( 11) into PLMs in a more dynamic way via: which provides a feasible connection between ID knowledge-injected representation and GD representation without knowledge injection and a flexible injection plugged into almost all inner modules of NNs.

A.3 Further Experiments
To further reveal the flexibility of injection and the ability of DG in the proposed method, extensive experimental results are conducted on four kinds of NNs (see Figure 5) for three datasets, as reported in Table 7.

Figure 1 :
Figure1: Comparisons of generalization performance in review classifications.Neural models injected UP information show high-performing capability of classification for ID UPs while degrading for unseen or anonymous UPs.Applying an extra plain-text model for unseen UPs can generate review representation while dissociating UPs from the recommender system.The proposed method masks out UP information with a generalized UP for sample augmentation.Moreover, a KD strategy facilitate the generalization-domain representation to learn only domain-invariant features from review representation injected specific UPs.Finally, sentiment models can effectively handle with reviews from historical or unseen UPs in inference.

Figure 3 :
Figure 2: The overview of the proposed method.

Figure 4 :
Figure 4: Comparisons with SKD parameters on Yelp-2013.Upper: Varying confidence filter threshold ε and decay factor β. Lower: Varying different temperatures of domain logits.

Figure 5 :
Figure 5: The diagram of four kinds of NNs.

Table 1 :
The statistics of the benchmark datasets.
).All datasets were split into Train D train , Dev D dev , and Test D test .To measure the generalization performance of our method, we chose D s = (x, k, y) ∈ D train as the source data with multiple ID knowledge of users and products and define D t = (x, k) ∈ D test as testing data where we simulate unseen target domains as all domain knowledge in D t is newly participated or anonymous.More details of the datasets are statistically listed in Table

Table 2 :
(Zhang et al., 2021b)results on plain-text, ID and OOD scenarios.Ori.andRei.mean the original figures reported in(Zhang et al., 2021b)and the reimplementation according to public available source codes.All figures are averaged over five runs.Underscored figures represent the best performance in each group.

Table 3 :
Meta comparisons of F 1 score in ID and OOD scenarios (denoted as I and O).I → O presents models learned in I and tested to O; Avg.means the average performance between ID and OOD scenarios; ∇O denotes the discrepancy between I → O and O.

Table 4 :
Comparative F 1 scores of GSwitch-BERT-GD (qkv) with different balances for forward and backward KLs.Detached teacher representation was marked with .Boldface figures represent the best performance.

Table 5 :
Ablations of weight-and bias-based injections in GSwitch modules for the ID scenario.

Table 6 :
Table4reports the quantitative results.It can be found that either forward or backward KL was feasible to implement knowledge distillation for DG.The forward KL was supported to distill F i from ID representations, and backward KL was utilized as a regularization Comparative F 1 performance on Yelp-2013 when different submodels were incorporated with personal knowledge via GSwitch modules.

Table 7 :
Further experiments on diverse injection strategies in four typical NNs.Figures in orange, red, and black represent IMDB, Yelp-2013, and Yelp-2014 datasets, respectively.* represents previous models and ** indicates the average performance of GSwitch modules over different injection places.