De-biasing Distantly Supervised Named Entity Recognition via Causal Intervention

Distant supervision tackles the data bottleneck in NER by automatically generating training instances via dictionary matching. Unfortunately, the learning of DS-NER is severely dictionary-biased, which suffers from spurious correlations and therefore undermines the effectiveness and the robustness of the learned models. In this paper, we fundamentally explain the dictionary bias via a Structural Causal Model (SCM), categorize the bias into intra-dictionary and inter-dictionary biases, and identify their causes. Based on the SCM, we learn de-biased DS-NER via causal interventions. For intra-dictionary bias, we conduct backdoor adjustment to remove the spurious correlations introduced by the dictionary confounder. For inter-dictionary bias, we propose a causal invariance regularizer which will make DS-NER models more robust to the perturbation of dictionaries. Experiments on four datasets and three DS-NER models show that our method can significantly improve the performance of DS-NER.


Introduction
Named entity recognition (NER) aims to identify text spans pertaining to specific semantic types, which is a fundamental task of information extraction, and enables various downstream applications such as Relation Extraction (Lin et al., 2016) and Question Answering (Bordes et al., 2015). The past several years have witnessed the remarkable success of supervised NER methods using neural networks (Lample et al., 2016;Ma and Hovy, 2016;Lin et al., 2020), which can automatically extract effective features from data and conduct NER in an end-to-end manner. Unfortunately, supervised methods rely on high-quality labeled data, which is very labor-intensive, and thus severely restricts * Corresponding authors the application of current NER models. To resolve the data bottleneck, a promising approach is distant supervision based NER (DS-NER). DS-NER automatically generates training data by matching entities in easily-obtained dictionaries with plain texts. Then this distantly-labeled data is used to train NER models, commonly be accompanied by a denoising step. DS-NER significantly reduces the annotation cost for building an effective NER model, and therefore has attracted great attention in recent years (Yang et al., 2018;Shang et al., 2018;Peng et al., 2019;Cao et al., 2019;Liang et al., 2020;Zhang et al., 2021). However, the learning of DS-NER is dictionarybiased, which severely harms the generalization and the robustness of the learned DS-NER models. Specifically, entity dictionaries are often incomplete (missing entities), noisy (containing wrong entities), and ambiguous (a name can be of different entity types, such as Washington). And DS will generate positively-labeled instances from the indictionary names but ignore all other names. Such a biased dataset will inevitably mislead the learned models to overfit in-dictionary names and underfit out-of-dictionary names. We refer to this as intradictionary bias. To illustrate this bias, Figure 1  The proposed structural causal model for DS-NER. It can be roughly divided into two parts: distant supervision (DS) and NER. From the SCM, we identify that the intra-dictionary bias stems from the spurious correlations caused by backdoor paths, while the inter-dictionary bias stems from the over-fit on the dictionary characteristics. Detailed explanations can be found in Section 2.
tive DS-NER model (RoBERTa + Classifier (Liang et al., 2020)). We can see that there is a remarkable likelihood gap between in-dictionary mentions and out-of-dictionary mentions: the average likelihoods of out-of-dictionary mentions are < 0.2, which means that a great majority of them cannot be recalled. Furthermore, such a skewed distribution makes DS-NER models very sensitive to slight perturbations. We refer to this as inter-dictionary bias, i.e., different dictionaries can result in very different model behaviors. In the example shown in Figure 1 (b), we train the same DS-NER model by respectively using 4 dictionaries sampled from the same original dictionary, where each of them covers 90% of entities in the original one. We can see that the predicting likelihood diverges significantly even these 4 dictionaries share the majority part. Consequently, the dictionary-biased learning will undermine both the effectiveness and robustness of DS-NER models.
In this paper, we propose a causal framework to fundamentally explain and resolve the dictionary bias problem in DS-NER. We first formulate the procedure of DS-NER from the causal view with a Structural Causal Model (SCM) (Pearl et al., 2000), which is shown in the left part of Figure 2. From the SCM, we identified that the intra-dictionary bias stemming from the dictionary which serves as a confounder during the model learning. The dictionary confounder will introduce two backdoor paths, one from positively-labeled instances (X p ) to entity labels (Y ) and the other from negatively-labeled instances (X n ) to entity labels. These backdoor paths introduce spurious correlations during learn-ing, therefore result in the intra-dictionary bias. Furthermore, the current learning criteria of DS-NER models is to optimize over the correlations between the instances (X) and entity types (Y ) given one specific dictionary (D), namely P (Y |X, D). Such criteria, however, diverges from the primary goal of learning a dictionary-free NER model (i.e., P (Y |X)), and results in the inter-dictionary bias. Based on the above analysis, unbiased DS-NER should remove the spurious correlations introduced by backdoor paths and capture the true dictionaryfree causal relations.
To this end, we conduct causal interventions to de-bias DS-NER from the biased dictionary. For intra-dictionary bias, we intervene on the positive instances and the negative instances to block the backdoor paths in SCM, then the spurious correlations introduced by dictionary confounder will be removed. Specifically, we conduct backdoor adjustment to learn de-biased DS-NER models, i.e., we optimize the DS-NER model based on the causal distribution, rather than from the spurious correlation distribution. For inter-dictionary bias, we propose to leverage causal invariance regularizer (Mitrovic et al., 2021), which will make the learned representation more robust to the perturbation of dictionaries. For each instance in the training data, causal invariance regularizer will preserve the underlying causal effects unchanged across different dictionaries. The proposed method is modelfree, which can be used to resolve the dictionary bias in different DS-NER models by being applied as a plug-in during model training.
We conducted experiments on four standard DS-NER datasets: CoNLL2003, Twitter2005, Webpage, and Wikigold. Experiments on three stateof-the-art DS-NER models show that the proposed de-biasing method can effectively solve both intradictionary and inter-dictionary biases, and therefore significantly improve the performance and the robustness of DS-NER in almost all settings. Generally, the main contributions of this paper are: • We proposed a causal framework, which not only fundamentally formulates the DS-NER process, but also explains the causes of both intra-dictionary bias and inter-dictionary bias.
• Based on the causal framework, we conducted causal interventions to de-bias DS-NER. For intra-dictionary bias, we conduct causal interventions via backdoor adjustment to remove spurious correlations introduced by the dictionary confounder. For inter-dictionary bias, we propose a causal invariance regularizer which will make DS-NER models more robust to the perturbation of dictionaries.
• Experimental results on four standard DS-NER datasets and three DS-NER models demonstrate that our method can significantly improve the performance and the robustness of DS-NER.

A Causal View on DS-NER
In this section, we formulate DS-NER with a structural causal model (SCM), then identify the causes of both intra-dictionary bias and inter-dictionary bias using the SCM. An SCM captures the causal effect between different variables and describes the generative process of a causal distribution, which can be visually presented using a directed acyclic graph (DAG). In SCM, each node represents a random variable, and a directed edge represents a direct causal relationship between two variables. Based on SCM, the confounders and backdoor paths (Pearl et al., 2000) can be identified. In the following, we will describe the causal view of DS-NER and then identify the dictionary bias. Figure 2 shows the structural causal model for DS-NER, which contains 7 key variables in the DS-NER procedure: 1) the applied dictionary D for distant annotation; 2) the unlabeled instances X, where each instance is a pair of (mention candidate, context), and in training stage X will be automatically labeled by D; 3) the positive training instances X p , which are instances in X being labeled as positive instances (i.e., entity mentions) by dictionary D; 4) the negative training instances X n , which are instances being labeled as negative instances by dictionary D; 5) the learned DS-NER model M , which summarizes NER evidences from DS-labeled data during training, and predicts new instances during testing; 6) the representations of instances R, which is encoded dense representations of instances X using the learned model M ; 7) the predicted entity labels Y of instances in X based on the representation R.

Structural Causal Model for DS-NER
Defining these variables, the causal process of DS-NER can be formulated using SCM into two steps: distant supervision (DS) step and NER step respectively. For DS step, the procedure will generate DS-labeled data and learn DS-NER models by following causal relations: • D→X p ←X and D→X n ←X represent the distant annotation process, which uses dictionary D to annotate the unlabeled instances X and splits them into two sets: X p and X n .
• X p →M ←X n represents the learning process, where model M is the learned DS-NER model using X p and X n . We denote the X p and X n generated from dictionary D as X p (D) and X n (D) respectively.
And the causal relation in NER step can be summarized as: • M →R←X is the representation learning procedure, which uses the learned model M to encode instances X.
• R→Y represents the entity recognition process, where the labels of instances depend on the learned representation R and instances X. We denote the entity labels corresponding to X p and X n as Y p and Y n respectively.

Cause of Intra-dictionary Bias
Given distant annotation X p and X n , the learning process of DS-NER will maximize the probability P (Y p =1, Y n =0|X p , X n , D). Unfortunately, because D is a confounder for X p and X n in SCM, this criteria will introduce spurious correlations and result in the intra-dictionary bias: (1) When maximizing P (Y =1|X p , D), we want NER models to rely on the actual causal path X p →Y . However, in SCM there exists a backdoor path X p ←D→X n →M which will introduce spurious correlation between Y and X p . Intuitively, this backdoor path appears as the false negative instances in X n . Because these false negative instances have correct entity contexts but outof-dictionary names, they will mislead the models to underfit the entity context for prediction.
(2) When maximizing P (Y =0|X n , D), we want NER models to rely on the actual causal path X n →Y . However, in SCM there exists a backdoor pathX n ←D→X p →M which will introduce spurious correlation between Y and X n . Intuitively, this backdoor path appears as the false positive instances in X p . Because these false positive instances have in-dictionary entity names but spurious context, they will mislead the models to overfit the names in dictionary.
In general, the intra-dictionary bias is caused by backdoor paths introduced by D, and this bias will mislead the NER models to overfit names in dictionary and underfit the context of entities.

Cause of Inter-dictionary Bias
As mentioned above, DS-NER models are learned by fitting P (Y p =1, Y n =0|X p , X n , D). This criteria will mislead the model when learning the correlation between X and Y with spurious information in D because the learning criteria is conditioned on it. However, a robust NER model should fit the underlying distribution P (Y |X), rather than the dictionary-conditioned distribution P (Y |X, D). From the SCM, the dictionary D will significantly influence the learned NER models M , and in turn result in different learned causal effects in the path X → R → Y and entity prediction Y . As a result, DS-NER models will fit different underlying distributions given different dictionaries, and therefore results in inter-dictionary bias.
However, in real-world applications, the dictionaries are affected by various factors, such as source, coverage or time. Therefore, to enhance the robustness of the learning process, it is critical to alleviate the spurious influence of dictionary D on the learned causal effects between X and Y . That is, we want DS-NER models to capture the dictionary-invariant entity evidence, rather than fit the dictionary-specific features.

De-biasing DS-NER via Causal Intervention
In this section, we describe how to de-bias DS-NER. Specifically, for intra-dictionary bias, we propose to use backdoor adjustment to block the backdoor paths. For inter-dictionary bias, we design a causal invariance regularizer to capture the dictionary-invariant evidence for NER.

Removing Intra-dictionary Bias via Backdoor Adjustment
Based on the analysis in Section 2.2, the intradictionary bias is caused by the backdoor paths X p ←D→X n →M and X n ←D→X p →M . To remove these biases, we block both backdoor paths by intervening both X p and X n . After causal intervention, the learning of DS-NER models will fit the correct causal relation P (Y p =1|do(X p (D)), X n ) and P (Y n =0|do(X n (D)), X p ).
Here do(X p (D))=do(X p =X p (D)) represents the mathematical operation to intervene X p and preserve it to be X p (D) in the whole population.
Backdoor Adjustments. To calculate the distribution P (Y p =1|do(X p (D))) after causal intervention, we conduct backdoor adjustment according to causal theory (Pearl, 2009): where X n (D i ) denotes the negative instances generated from the DS dictionary D i . P (Y p =1|X p (D), X n (D i )) is the probability of predicting X p (D) into Y =1, which can be formulated using a neural network-based DS-NER model parametrized by θ, i.e., P (Y |X p , X n ) = P (Y |X p , X n ; θ). Detailed derivations is shown in appendix A.
Note the distribution P (Y p =1|do(X p (D))) in the causal framework is not the marginalized distribution P (Y p =1|X p (D)) in the probability framework. Otherwise the marginalization should take place in the conditional distribution P (D i |X p ) rather than P (D i ). Furthermore, as shown in Figure 3, X p =X p (D i ) and X n =X n (D j ) can not happen together in probabilistic view unless D i =D j . However, in the causal view, they can happened together via the causal intervention. That is do(X p =X p (D i )) and X n =X n (D j ), which is Positive Negative shown in Figure 3 (c). For more details, please refer to (Neal, 2020) for a brief introduction. Similarly, to block the backdoor paths and calculate the causal distribution P (Y n =0|do(X n (D))), we can conduct backdoor adjustment on X n by: Estimating Dictionary Probabilities. Because we only have one global dictionary D, it is hard to estimate the probability of other dictionary D i used in the Equation (1) and (2). To tackle this problem, we sample K sub-dictionaries by sampling entities from the global dictionary D. The probability of each entity being sampled corresponds to its utterance frequency in a large-scale corpus. Then we applied a uniform probability assumption to these sampled dictionaries, which means that these subdictionaries will then be used to conduct backdoor adjustment with equal dictionary probabilities, i.e., P (D i ) = 1 K . Learning DS-NER Models with Causal Relation. Given the above two causal distributions after backdoor adjustment, the DS-NER models can be effectively learned, and the intra-dictionary bias can be eliminated based on the causal relations between X p , X n and Y . Formally, we optimize DS-NER models by minimizing the following negative likelihood based on causal relation: Note that the proposed method is model-free, which means that it can be applied to the majority of previous DS-NER methods by adaptively changing the underlying parametrization of probability distribution P (Y |X p , X n ; θ) .

Eliminating Inter-dictionary Bias via Causal Invariance Regularizer
This section describes causal invariance regularizer to eliminate the inter-dictionary bias. Specifically, after backdoor adjustment for intra-dictionary bias, the causal distribution we optimize (i.e., P pos (D) and P neg (D)) still depends on the dictionary D. As a result, given different dictionaries, DS-NER models will fit different underlying causal distributions and result in inter-dictionary bias. Ideally, a robust DS-NER learning algorithm should be dictionary-free, i.e., we should directly optimize towards the implicit distribution of P (Y |X). However, it is impossible to directly achieve this because the golden answer Y of X is invisible in DS-NER. To enhance the robustness of the learning process, this section proposes a causal invariance regularizer, which ensures DS-NER models to learn useful entity evidence for NER but not to fit dictionary-specific features. Specifically, the goal of causal invariance (Pearl et al., 2000) is to ensure learned NER models will keep similar causal effects using different dictionaries, which can be formulated as: Here || * || measures the distance between two distributions. However, as we mentioned above, this distance cannot be directly optimized because the golden label Y of X is unknown. Fortunately, in the SCM, the impact from dictionary D to the entity label Y are all through the model M and the representation R, i.e., through the path D → M → R → Y . As a result, the bias from the dictionary D can be eliminated by preserving the causal effects between X and any node in the path. A simple and reasonable solution is to preserve the causal invariance of the representation R. That is, given different dictionaries, we keep the causal effects from X to R unchanged, and therefore causal effects of X → Y will remain unchanged. Specifically, when learning causal effects given an dictionary D, the causal invariance regularizer will further enhance its causal consistency with other dictionaries by minimizing its represen-tation distances to other dictionaries: Here R D (x; θ) is the representation of instance x, which is derived from the NER model M by fitting the causal effects of dictionary D. The reference dictionary D i in the formulations are generated in the same way as we described in Section 3.1 and K is the number of sub-dictionaries. Therefore, this regularizer ensures that the representations learned using different dictionaries will be consistent, and the inter-dictionary bias is eliminated. Finally, we combine (3) and (5) to de-bias both intra-dictionary bias and inter-dictionary bias and obtain the final DS-NER models by optimizing: where λ is a hyper-parameter which controls the relative importance of these two losses and is tuned on the development set. Distant Annotation Settings. We use two distant annotation settings: String-Matching and KB-Matching (Liang et al., 2020). String-Matching labels dataset by directly matching names in dictionary with sentences. KB-Matching is more complex, which uses a set of hand-crafted rules to match entities. We find KB-Matching can generate better data than String-Matching, but String-Matching is a more general setting. In our experiments, we report performance on both KB-Matching and String-Matching settings.

Implementation
Detail. We implement BiLSTM-CRF with AllenNLP (Gardner et al., 2017), an open-source NLP research library, and the input vector is the 100-dimension GloVe Embeddings (Pennington et al., 2014). For other baselines, we use the officially released implementation from the authors. We openly release our source code at github.com/zwkatgithub/DSCAU.

Baselines
The proposed de-biased training strategy is both model-free, and learning algorithm-free. Therefore, we use the following base DS-NER baselines and compare the performance of using/not using our de-biased training strategy: DictMatch , which perform NER by directly matching text with names in a dictionary, so no learning is needed.
Naive Distant Supervision (Naive) , which directly uses weakly labeled data to train a fullysupervised model. It could be considered as the lower bound of DS-NER.
Positive-Unlabeled Learning (PU-Learning) (Peng et al., 2019), which formulates DS-NER as a positive-unlabeled learning problem. It could obtain unbiased loss estimation of unlabeled data. However, it assumes that there are no false positive instances which may be incorrect in many datasets.
BOND (Liang et al., 2020), which is a two-stage learning algorithm: In the first stage, it leverages pre-trained language model to improve the recall and precison of the NER model; In the second stage, it adopts a self-training approach to further improve the model performance. Table 1 and Table 2 show the overall performance (F1 scores) of different baselines and our methods.  For our method, we use BA to denote backdoor adjustment, and CIR to denote causal invariance regularizer. We conduct our debiasing method on three base models: RoBERTa-base, PU-Learning and BOND, therefore we have 6 systems of our methods: RoBERTa+BA, RoBERT+BA+CIR, PU-Learning+BA, PU-Learning+BA+CIR, BOND +BA, BOND+BA+CIR. We can see that:

Main Results
(1) DS-NER models are severely influenced by the dictionary bias. Without debiasing, the naive DS-NER baselines BiLSTM-CRF and RoBERTa-base can only achieve comparable performance with the simple DictMatch baselines. And by taking the dictionary bias into consideration, PU-Learning, BOND with our method can significantly improve the performance of DS-NER. Compared with DictMatch, they correspondingly achieve 4.99%, 21.98% F1 improvements on average. This verified that the dictionary bias is critical for DS-NER models.
(2) By debiasing DS-NER models via causal intervention, our method can achieve significant improvement. Compared with their counterparts, our full methods RoBERTa+BA+CIR, BOND+BA+CIR correspond-ingly achieve 4.91%, 3.18% improvements averaged on four datasets in KB-Matching (5.75%, 2.56% improvements on String-Matching) and PU-Learning+BA+CIR achieves 9.34% improvement on CoNLL2003 dataset in . This verified the effectiveness of using causal intervention for debiasing DS-NER. (3) Our method can effectively resolve both intra-dictionary and inter-dictionary biases. Both of backdoor adjustment and causal invariance regularizer can improve the NER performance. By conducting backdoor adjustment, our method can achieve a 3.27% F1 improvement averaged on all base models and all datasets. And further conducting causal invariance regularizer can future improve 4.63% average F1.

Effects on Robustness
To verify whether the causal invariance regularizer can significantly improve the robustness of DS-NER across different dictionaries, we further compared the predicting likelihood of golden mentions using different dictionaries. Specifically, we train the same RoBERTa-Classifier DS-NER models by sampling 4 dictionaries. Figure 4 shows the average predicting likelihood before/after using our de-biasing method.
From Figure 4, we can see that the proposed causal invariance regularizer significantly reduced the likelihood gaps between different dictionaries. This verified that removing the inter-dictionary bias can significant benefit the robustness of DS-NER. Furthermore, we can see that the likelihoods of golden mentions are remarkably increased, which represents a better NER performance. These  Figure 4: The likelihood variance between different dictionaries before/after using causal invariance regularizer (RoBERTa-Classifier on CoNLL2003), We can see that the performance variance significantly decreases, which verifies that causal invariance regularizer can significantly improve the robustness of DS-NER. all demonstrate the effectiveness of the proposed causal invariance regularizer.

Influence of Sub-dictionaries
To conduct causal intervention, our method needs to sample sub-dictionaries from the original one.
To analyze the influence of the coverage and the quantity of sub-dictionaries, we conducted experiments on sub-dictionaries with different coverages and different quantities.
Dictionary Coverage. Figure 5 shows the results with different dictionary coverages. We can see that our method is not sensitive to the coverage of sub-dictionaries: it can achieve robust performance from 40% to 80% coverage. All three models achieved the best performance at the 70% coverage. This result demonstrates the robustness of our method on dictionary coverage.
Dictionary Quantity. Figure 6 shows the results with different sub-dictionary quantities. We can see that our method can achieve performance improvement by sampling more sub-dictionaries. This is because more sub-dictionaries will lead to more accurate estimation of both the dictionary probability in backdoor adjustment and the dictionary variance in causal invariance regularizer. Futhermore, we can see that the performance using only one sub-dictionary (i.e., DS-NER without causal intervention) is significantly worse than other settings, this further verifies the effectiveness of our method.

Related Work
DS-NER. Supervised NER models have achieved promising performance (Lample et al., 2016;Lin et al., 2019a,b (Pearl, 2009;Pearl and Mackenzie, 2018) has been widely adopted in psychology, politics and epidemiology for years (MacKinnon et al., 2007;Richiardi et al., 2013;Keele, 2015). It can provide more reliable explanations by removing confounding bias in data, and also provide debiased solutions by learning causal effect rather than correlation effect. Recently, many causal inference techniques are used in computer vision Qi et al., 2020) and natural language process (Wu et al., 2020;Zeng et al., 2020).

Conclusions
This paper proposes to identify and resolve the dictionary bias in DS-NER via causal intervention. Specifically, we first formulate DS-NER using a structural causal model, then identity the causes of both intra-dictionary and inter-dictionary biases, finally de-bias DS-NER via backdoor adjustment and causal invariance regularizer. Experiments on four datasets and three representative DS-NER models verified the effectiveness and the robustness of our method.

A Proof of Backdoor Adjustment
We prove the backdoor adjustment for SCM using the do-calculus (Pearl, 1995) and the Truncated Factorization (Neal, 2020). First of all, we write the joint distribution as shown in our causal graph: P (D,X p , X n , Y, M, R, X) =P (D)P (X)P (X p |D, X)P (X n |D, X) P (M |X p , X n )P (R|M, X)P (Y |R) Due to the objective of our method is debiasing DS-NER models during training, we ignore the unlabeled instances variable X which is not related to the training process. Then we obtain the following equation: P (D,X p , X n , Y, M, R) =P (D)P (X p |D)P (X n |D) P (M |X p , X n )P (R|M )P (Y |R) Note that the prediction step of a NER model M →R→Y doesn't have causal effects with other variables, we abbreviate P (M |X p , X n )P (R|M, X)P (Y |R) as P (Y |X p , X n ). Finally, we obtain the simplified joint distribution: P (D, X p , X n , Y ) =P (D)P (X p |D)P (X n |D)P (Y |X p , X n ) Then we conduct causal intervention on X p , i.e., do(X p =X p (D)) where X p (D) denotes positive instances generated by dictionary D. Here, we abbreviate it as do(X p (D)). In practice, do(X p (D)) denotes that we use these positive instances to calculate loss value, therefore, in order to explicitly indicate this, we use Y p =1 in the following equation. According to the Truncated Factorization (Neal, 2020), we can know P (X p |D)=1, and obtain the following equation: P (D,X n , Y p =1|do(X p (D))) =P (D)P (X n |D)P (Y p =1|X p (D), X n ) Next, we integrate D and X n : P (Y p =1|do(X p (D))) = i X n P (D i )P (X n |D i )P (Y p =1|X p (D), X n ) Note that P (X n (D i )|D i )=1 if and only if X n is generated by a specific dictionary D i , therefore we can obtain: P (Y p =1|do(X p (D))) = i X n P (D i )P (X n |D i )P (Y p =1|X p (D), X n ) = i P (D i )P (Y p =1|X p (D), X n (D i ))