Always the Best Fit: Adaptive Domain Gap Filling from Causal Perspective for Few-Shot Relation Extraction

Cross-domain Relation Extraction aims to transfer knowledge from a source domain to a different target domain to address low-resource challenges. However, the semantic gap caused by data bias between domains is a major challenge, especially in few-shot scenarios. Previous work has mainly focused on transferring knowledge between domains through shared feature representations without analyzing the impact of each factor that may produce data bias based on the characteristics of each domain. This work takes a causal perspective and proposes a new framework CausalGF . By constructing a unified structural causal model, we estimate the causal effects of factors such as syntactic structure, label distribution, and entities on the outcome. CausalGF calculates the causal effects among the factors and adjusts them dynamically based on domain characteristics, enabling adaptive gap filling. Our experiments show that our approach better fills the domain gap, yielding significantly better results on the cross-domain few-shot relation extraction task.


Introduction
Relation Extraction (RE) is one of the key tasks of Natural Language Processing (NLP), which aims to identify the relations between given entities.RE models (Zhang et al., 2017;Yamada et al., 2020) have impressive performance through large-scale supervised learning based on BERT (Devlin et al., 2019) and LSTM (Hochreiter and Schmidhuber, 1997).However, collecting sufficient amounts of data for certain classes may be laborious in practice.Although finetuning prompt-based pre-trained language models (He et al., 2023;Liu et al., 2022) have shown superior performance in few-shot RE tasks, they have encountered challenges in dealing with cross-domain problems.The main issue is that variables such as labels, syntactic structure, and entities having different distributions in each domain, resulting in data bias across different domains.From a causal perspective, the essence of this bias is that these variables have different causal effects on the results in different domains.Domain adaptation methods (Ganin et al., 2016;Shen et al., 2018) offer new insights to tackle these issues by transferring knowledge between domains through shared feature representations extracted from multiple domains.However, these work relies excessively on extracting shared features and label distributions while ignoring the unique feature of each domain, which would lead to inferior results when domains have significant semantic gaps.Therefore, Zhang and Lu (2022) proposed a label prompt dropout approach to eliminate the model's over-reliance on labels, but it is difficult to adequately capture the critical features of each domain by randomized dropout.
To address this issue, our work focuses on identifying and adjusting the causal effects of variables by considering the distinct characteristics of different domains.We propose a novel framework CausalGF and build a unified structural causal model (SCM) (Pearl et al., 2000) to describe the cross-domain RE task, as shown in Figure 1.The values and relationship in the graph can be altered

Methodology
The overall framework of CausalGF is shown in figure 2. Sections 2.1 and 2.2 describe the structural causal modeling and causal operation.2.3 and 2.4 describe the implementation of the our method.

Structural Causal Modeling
As shown in Figure 1(a), cross-domain RE task is represented by unified SCM G.The variable C indicates the contextualized representations of an input text, which is output by the pretrain BERT encoder (Devlin et al., 2019).The variables S and E denote the syntactic structure and the representation of entities in the sentence respectively, which have a direct causal effect on C. Further, as there are no other parent nodes for node C, we can represent it by the nodes S and E, as shown in Figure 1(b).L denotes the label description.The variable X is the representation of a relation for RE which is computed from C, and Y indicates the output logits for prediction.On the edge C → X, we fuse the semantic information and linguistic structures into SCM by adopting Transformer (Vaswani et al., 2017) to obtain the representation of node X.The causal effect of the parent node on Y is obtained by full connectivity with a nonlinear transformation.
The concepts and theories of causal inference are detailed in Appendix A.

Causal Operations
In our work, we aim to explore the causal effects of variables on the outcomes.From a causal perspective, that is, utilizing intervention and counterfactual generation as causal operations to explore the causal effects of the variables S, E, L, and X on Y in SCM G. Since the variables S and E have causal effects on both X and Y , their changes interfere with the calculation of the causal effect of X → Y .Therefore, counterfactuals S * and E * are generated and intervened on X upon to block the backdoor path (Morgan and Winship, 2014) and eliminate the causal effect of the original S and E on X (Figure 1(c)).Meanwhile, the original S and E are preserved and restored to the original C to estimate C → Y and maintain the semantic information of the original input.
For SCM G, X * and Y X * , counterfactual of X and the original prediction Y X are computed as: Where f Y is the function that computes Y .

Prompt Generation and Encoding
In order to implement the above causal operations, we obtained a representation of the above variables through prompt generation and encoding.
As shown in Figure 2, variable L is represented by the label description.Counterfactual E * is obtained by extracting all entities in the sentence and connecting them in sequence.We mask these entities to obtain the counterfactual syntactic structure S * .Besides, the original input C is retained in prompt, and we add [CLS], [L], [S], and [E] as placeholder separators between variables.The given entity pair {e head , e tail } are warped with special token [E1], [/E1], [E2] and [/E2] following the approach of Zhang et al. (2019).
where h is the output embedding for each token in T all .h L , h

Training and Inference
In the SCM G, the parents of the outcome variable Y are denoted as E = {X, C, L}.During the training phase, we calculate the causal effect t of each variable in E on Y by utilizing the class prototype r and variable representation h.To flexibly adjust the influence of each variable on the output, we introduce a learnable weight matrix The final causal effect Q total is computed using Formula (8).
The class prototype r ∈ R N C ×H is calculated by averaging the relation representations of the N support instances of each class.Where N C indicates the number of classes, H indicates the input hidden dimension.We optimize a hinge loss function by introducing the total casual effect Q total .
During inference, we choose the relation y as the prediction by finding the closest class prototype to the query sentence's relation representation: Where r k is the class prototype of class k, Q q total is representation of the query instance.
3 Experiment We compare CausalGF with the following baseline methods: Proto-BERT (Snell et al., 2017) is a prototypical network with BERT-base (Devlin et al., 2019) serving as the backbone.BERT-PAIR (Gao et al., 2019) is a method that measures the similarity of a sentence pair.CP (Peng et al., 2020) pretrains Proto-BERT using a contrastive pre-training approach that divides sentences into positive pairs and negative pairs.HCRP (Han et al., 2021) equips Proto-BERT with a hybrid attention module and a task adaptive focal loss.Improved Domain Adaption (IDA) (Yuan et al., 2022) proposes an encoder learned by optimizing a representation loss and an adversarial loss to extract the relation of sentences in the source and target domain.LPD (Zhang and Lu, 2022) introduces a label prompt dropout approach which is adaptable to cross-domain tasks.
We follow the pretraining method of LPD (Zhang and Lu, 2022) which pretrained on the Wikipedia dataset and on top of BERT-base from the Huggingface Transformer library.We perform multiple experiments with different random seeds and report the average accuracy together with the standard deviation.Detailed descriptions of experimental settings can be found in Appendix C.

Results and Analysis
Main Results: Table 1 shows that CausalGF outperforms all baseline models in CrossRE, achieving an average improvement of at least 1.90% and  1.59% for single and multiple source domain scenarios, respectively.Compared to the previous state-of-the-art LPD, our 10-way-1-shot results show enhancements of 3.83% and 2.18% with single and multiple sources, respectively.This highlights that adjusting causal effects adaptively is superior to random strategies in reducing the model over-dependence on labeling and data distribution.
As shown in Table 2, our approach significantly outperforms HCRP, IDA, and CP by at least 4.44% and 4.10% in 1-shot and 5-shot settings, respectively.This indicates that learning the influence of variables based on causal theory is more effective than previous adaptive and contrastive methods in cross-domain few-shot tasks.
Experimental results for other domains can be found in Appendix D.
Ablation Studies: We construct ablation experiments on the music domain of CrossRE and Fel-rel2.0dataset to investigate the contribution of each component in our approach.We implement a w/o counterfactual generation experiment by removing prompt generation, a w/o causal effect estimation by removing the weight matrix W , and a w/o causal effect adjustment experiment by initializing fixed W . Table 3 and 4 indicate that removing any part of our approach leads to varying degrees of decline in model performance.Notably, the results of  the w/o causal effect adjustment experiment reveal that improper utilization of causal effects can be counterproductive to the model's performance.

Causal Effects Across Different Domains
To verify the ability of our model adaptively filling gaps according to different domain characteristics, we explored the causal effect of each variable (L, S, E, C) on the results in different cross-domain tasks.Figure 3 shows the normalized average prediction logits for the ground truth which obtain from final causal effects Qi (i = L, S, E, C) under six different cross-domain scenarios.The final causal effects of the same variables differ in different cross-domain scenarios, demonstrating that CausalGF adaptively fills gap by adjusting for causal effects.For instance, in domains with significant differences, entities are less influential while syntax structure plays a more critical role since entities  are domain-specific and semantic information of them is difficult for the target domain to utilize.In contrast, syntactic structure is more universal and applicable across various domains.This phenomenon is consistent with our intuition.We visualize the feature space to further demonstrate the effectiveness of CausalGF in Appendix ??.

In-Domain Experiments
To further demonstrate the effectiveness of our method, we conducted additional experiments on in-domain tasks.We conducted experiments on Fewrel 1.0 and CrossRE datasets, and the results are shown in Table 5 and 6, respectively.The results demonstrate that our method achieves competitive performance in in-domain tasks.This in-dicates that CausalGF has a universal capability to enhance the model's ability to learn features and make accurate predictions.It also proves the general significance of the causal effect estimation method in relation extraction tasks.

Visualization
During the process of model forwarding, we collect the vector representations of the test samples along with their respective category labels.The t-SNE toolkit (Van der Maaten and Hinton, 2008) is used to map the high-dimensional feature space of the test samples onto a two-dimensional plane, allowing for the measurement of sample similarity based on these representations.To evaluate the effectiveness of CausalGF, we implemented three ablation experiments and a random dropout method (LPD) to compare with our approach.Figure 4 shows the visualization results for 100 samples from 5 labels in a 5-way-1-shot scenario.The results clearly show that CausalGF outperforms other methods in terms of classification effectiveness.This signifies that the inclusion of counterfactual generation and the adaptation of causal effects greatly enhance the ability to learn domain-specific features.

Conclusion
In this paper, we propose CausalGF, a novel framework based on a causal perspective.By building a unified structural causal model, CausalGF estimates the causal effects of factors contributing to data bias and dynamically adjusts them to accommodate domain characteristics.Our model effectively fills the domain gap, outperforming strong baselines in various cross-domain scenarios.

Limitations
Some limitations exist in our work.Our effectiveness is only examined on the task of relation extraction, while whether this method is able to generalize to other information extraction tasks, such as named entity recognition (NER) and event detection (ED), is not yet explored in this paper.In addition, a more fine-grained partition of variables with causal effects on the outcome may enhance the efficacy of counterfactual generation.The above issues will be explored in our future studies.have parent nodes and whose causes are not usually taken into consideration (e.g.node Z).The function set F is defined as F = {f 1 , ..., f n }, where f i represents the corresponding relationship between variables.The variables V i are determined by the functions as V i = f i (A i , U i ), where A i and U i represent the endogenous and exogenous variables, respectively, which have a direct causal effect on V i .In terms of causality, the parent node is the cause and the children are the effect.As depicted in Figure 5(a), variable X has a direct causal effect on variable Y (X → Y ), while variable Z has a direct causal effect (Z → Y ) and an indirect causal effect (Z → X → Y ) on Y .The structural causal model provides a framework for comprehending the causal relationships between variables, allowing us to conduct experiments, make predictions, and intervene in these relationships.
Intervening: Intervening on a variable in a structural causal model involves fixing the value of that variable in order to study its correlation with other variables and its causal effects.Intervention can be represented by the do-calculus.For instance, the intervention on variable X can be denoted as do(X = x * ), where x * represents the given value (Pearl, 2009).As shown in Figure 5(b), after intervening on variable X, the causal relationship be-tween X and its parents will be cut off.Meanwhile, the backdoor path from X to Y (X ← Z → Y ) will also be blocked.At this point, variable Z no longer has a simultaneous causal effect on X and Y .Therefore, intervention can remove the confusion between variables and facilitate the estimation of the causal effect between variables.
Counterfactual: Counterfactuals emphasize the outcome of a hypothetical condition if a variable is hypothesized under identical conditions of reality.The concept of a counterfactual reflects a hypothetical scenario of "what would the outcome be if the variables were different".Unlike interventions which examine the effects on outcomes of implementing certain dispositional observations on variables in reality, counterfactuals focus on fictional scenarios that did not occur.As shown in Figure 5(c), assuming that the variable Z is Z * in the case, the estimate of the causal effect on X can be expressed as X Z * .

B Related Work
Cross-Domain Few-Shot Learning: In crossdomain few-shot learning, base and novel classes are both drawn from different domains, and the class label sets are disjoint.Although the supervised paradigm is effective in fundamental tasks, it suffers from the limitation of insufficient labeled data.To address this issue, previous work has proposed a variety of methods for Few-shot learning as well as domain adaptation.
Data-based few-shot learning methods augment the data with prior knowledge to overcome the difficulty of insufficient data (Gao et al., 2018;Wu et al., 2018;Cong et al., 2021).Algorithm-based methods leverage prior knowledge to search for an initial solution that is effective for multiple tasks simultaneously, which makes it facilitating the adaptation to new tasks (Finn et al., 2017;Yoo et al., 2018).Metric-based methods employ an encoder based on a metric to refine the sentence embedding in the latent space, allowing the learned latent space to generalize to novel relations with few labeled samples in the same domain (Triantafillou et al., 2017;Baldini Soares et al., 2019).
Domain adaption studies how to benefit from different but related domains.Shen et al. (2018) introduced Wasserstein distance to improve the generalization ability by constructing domain-invariant space between the source and target domain.Shi et al. (2018) 8.For the other 4 domains, we report the average results of their experiments in Table 9.Specifically, CausalGF achieves an average improvement of at least 3.42% and 3.02% for single and multiple source domain scenarios, respectively.Furthermore, our approach achieves the best performance in all of the aforementioned cross-domain experiments, demonstrating that CausalGF adapts to domain characteristics and fills the gap in different cross-domain scenarios.

E Case Study
As shown in Table 10, we show the content of two cases in the CrossRE dataset, and the prediction results of the previous sota LPD and our method.
Since the semantics and application scenarios of the two labels (win-defeat and opposite) are relatively similar, LPD, a method that focuses mainly on the semantics of the labels, tends to confuse them, resulting in prediction errors.
Our approach can adaptively fill the domain gap by recognizing and dynamically adjusting the causal effects of different variables in a sentence.As a result, causalGF will not be overly dependent on a single variable in a sentence or be negatively affected by the similarity of a single variable in an instance.The table above shows the adjustment of causal effects by our method.It can be found that after adjustment, different variables have different causal effects, and our model can make correct and clear predictions in this way.

Figure 1 :
Figure 1: (a) An unified SCM for the task (b) Denoting variable C by S and E (c) Blocking backdoor path by intervention and counterfactual generation

Figure 2 :
Figure 2: An overview of CausalGF.We utilize counterfactual to acquire representations for each variable and estimate their causal effects.The final effects are dynamically adjusted by intervention.Prompt generation, dynamic weighting, and loss function design are used to implement causal operations.by intervention and counterfactual generation to study the causal effect of various factors.Furthermore, to adapt to the feature of different domains, it is necessary to estimate the causal effects of various factors and adjust them dynamically.CausalGF implements causal operations and dynamically adjusts the causal effects of variables through prompt generation, dynamic weighting, and loss function improvement, enabling adaptive gap filling based on domain characteristics.We summarize the contributions as follows: • To the best of our knowledge, CausalGF is the first work analyzing data bias and the influence of various factors from a causal perspective in cross-domain few-shot RE task.• We dynamically estimate and adjust the causal effects of factors in training and inference, enabling adaptive gap filling according to the domain characteristics.• Extensive experiments on different datasets and settings demonstrate the effectiveness of our approach.CausalGF outperforms previous state-of-the-art methods in all scenarios.

Table 1 :Figure 3 :
Figure 3: The final causal effects of the variable C, E, S and L in six different cross-domain scenarios.

Figure 4 :
Figure 4: The t-SNE visualization results for 100 samples from 5 labels in a 5-way-1-shot scenario on CrossRE.

Figure 5 :
Figure 5: Example of a structural causal model.(a) Structural causal modeling (b) Intervening on variable X (c) Counterfactual generation for variable Z Target Domain Single Source Multiple Source Music AI Domains w/o Music AI Music Domains w/o AI Literature Science Domains w/o Literature Science Literature Domains w/o Science Politics News Domains w/o Politics News Politics Domains w/o News

Table 7 :
Domain segmentation of single source and multiple source on CrossRE dataset .

Table 10 :
employed an adversarial paradigm to extract class agnostic features in different domains.United Kingdom lacks the charismatic leader needed to keep the country together and Nazi Germany successfully conquers Great Britain via Operation Sea Lion in 1940.opposite 0.335 windefeat 0.306 cause 0.197 . . . . . .win-defeat 0.681 opposite 0.209 named 0.091 . . . . . .Q C 0.361 Q S 0.104 Q E 0.327 Q L 0.178 Two cases from CrossRE dataset, their source domain is news and target domain is politics.D Supplementary Experiments on CrossRE we conducted 12 cross-domain experiments in 6 domains on the CrossRE dataset, as shown in the table 7. Except for the Music domain shown in section 3.2, we present the results of the remaining experiments here.For the AI domain, we present the detailed results of each few-shot setting in Table