Uncovering Main Causalities for Long-tailed Information Extraction

Information Extraction (IE) aims to extract structural information from unstructured texts. In practice, long-tailed distributions caused by the selection bias of a dataset may lead to incorrect correlations, also known as spurious correlations, between entities and labels in the conventional likelihood models. This motivates us to propose counterfactual IE (CFIE), a novel framework that aims to uncover the main causalities behind data in the view of causal inference. Specifically, 1) we first introduce a unified structural causal model (SCM) for various IE tasks, describing the relationships among variables; 2) with our SCM, we then generate counterfactuals based on an explicit language structure to better calculate the direct causal effect during the inference stage; 3) we further propose a novel debiasing approach to yield more robust predictions. Experiments on three IE tasks across five public datasets show the effectiveness of our CFIE model in mitigating the spurious correlation issues.


Introduction
The goal of Information Extraction (IE) is to detect the structured information from unstructured texts.Previous deep learning models for IE tasks, such as named entity recognition (NER; Lample et al. 2016), relation extraction (RE; Peng et al. 2017) and event detection (ED; Nguyen and Grishman 2015), are largely proposed for learning under some reasonably balanced label distributions.However, in practice, these labels usually follow a long-tailed distribution (Doddington et al., 2004).
Figure 1 shows such an unbalanced distribution on the ACE2005 (Doddington et al., 2004) dataset.As a result, performance on the instance-scarce (tail) classes may drop significantly.For example, * * Contributed equally.
Accepted as a long paper in the main conference of EMNLP 2021 (Conference on Empirical Methods in Natural Language Processing).on an existing model for NER (Jie and Lu, 2019), the macro F1 score of instance-rich (head) classes can be 71.6, while the score of tail classes sharply decreases to 41.7.
The underlying causes for the above issue are the biased statistical dependencies between entities 1 and classes, known as spurious correlations (Srivastava et al., 2020).For example, an entity Gardens appears 13 times in the training set of OntoNotes5.0, with the NER tag location LOC, and only 2 times as organization ORG.A classifier trained on this dataset tends to build spurious correlations between Gardens and LOC, although Gardens itself does not indicate a location.Most existing works on addressing spurious correlations focus on images, such as re-balanced training (Lin et al., 2017), transfer learning (Liu et al., 2019) and decoupling (Kang et al., 2019).However, these approaches may not be suitable for natural language inputs.Recent efforts on information extraction (Han et al., 2018;Zhang et al., 2019) incorporate prior knowledge, which requires data-specific designs.
Causal inference (Pearl et al., 2016) is promising in tackling the above spurious correlation issues caused by unbalanced data distribution.Along this line, various causal models have been proposed for visual tasks (Abbasnejad et al., 2020;Tang et al., 2020b).Despite their success, these methods may be unsatisfactory on textual inputs.Unlike images, which can be easily disentangled with detection or segmentation methods for causal manipulation, texts rely more on the context involving complex syntactic and semantic structures.Hence it is impractical to apply the methods used in images to disentangle tokens' representations.Recent causal models (Zeng et al., 2020;Wang andCulotta, 2020, 2021) on text classification eliminate biases by replacing target entities or antonyms.These methods do not consider structural information, which has proven effective for various IE tasks as they are able to capture non-local interactions (Zhang et al., 2018;Jie and Lu, 2019).This motivates us to propose a novel framework termed as counterfactual information extraction (CFIE).Different from previous efforts, our CFIE model alleviates the spurious correlations by generating counterfactuals (Bottou et al., 2013;Abbasnejad et al., 2020) based on the syntactic structure (Zhang et al., 2018).
From a causal perspective, counterfactuals state the results of the outcome if certain factors had been different.This concept entails a hypothetical scenario where the values in the causal graph can be altered to study the effect of the factor.Intuitively, the factor that yields the most significant changes in model predictions has the greatest impact and is therefore considered as the main effect.Other factors with minor changes are categorized as side effects.In the context of IE with language structures, counterfactual analysis answers the question on "which tokens in the text would be the key clues for RE, NER or ED that could change the prediction result".With that in mind, our CFIE model is proposed to explore the language structure to eliminate the bias caused by the side effects and maintain the main effect for prediction.We show the effectiveness of CFIE on three representative IE tasks including NER, RE and ED.Our code and the supplementary materials are available at https: //github.com/HeyyyyyyG/CFIE.Specifically, our major contributions are: • To the best of our knowledge, CFIE is the first study that marries the counterfactual analysis and syntactic structure to address the spurious correlation issue for long-tailed IE.We build different structural causal models (SCM; Pearl et al. 2016) for various IE tasks to better capture the underlying main causalities.
• To alleviate spurious corrections, we generate counterfactuals based on syntactic structures.
To achieve more robust predictions, we further propose a novel debiasing approach, which maintains a better balance between the direct effect and counterfactual representations.
• Extensive quantitative and qualitative experiments on various IE tasks across five datasets show the effectiveness of our approach.

Model
Figure 2 demonstrates the proposed CFIE method using an example from the ACE2005 dataset (Doddington et al., 2004) on the ED task.As shown in Figure 2 (a), two event types "Life:Die" and "SW:Quit" of the trigger killed have 511 and 19 training instances, respectively.Such an unbalanced distribution may mislead a model to build spurious correlations between the trigger word killed and the type "Life:Die".The goal of CFIE is to alleviate such incorrect correlations.CFIE employs SCM (Pearl et al., 2016) as causal diagram as it clearly describes relationships among variables.
We give the formulation of SCM as follows.
SCM: Without loss of generality, we express SCM as a directed acyclic graph (DAG) G = {V, F, U}, where the set of observables (vertices) are denoted as V = {V 1 , ..., V n }, the set of functions (directed edges) as F = {f 1 , ..., f n }, and the set of exogenous variables (e.g., noise) (Pearl et al., 2016) as U = {U 1 , ..., U n } for each vertice.Here n is the number of nodes in G.In the deterministic case where U is given, the values of all variables in SCM are uniquely determined.Each observable V i can be derived from: where PA i ⊆ V\V i is the set of parents of V i and "\" is an operator that excludes V i from V, and f i refers to the direct causation from PA i to its child variable V i .Next we show how our SCM-based CFIE works.sentence, where the representations are the output from a BiLSTM (Schuster and Paliwal, 1997) or a pre-trained BERT encoder (Devlin et al., 2019).Z j (j ∈ {1, 2, . . ., m}) represents features such as the NER tags and part-of-speech (POS) tags, where m is the number of features.The variable X is the representation of a relation, an entity and a trigger for RE, NER and ED, respectively, and Y indicates the output logits for classification.

Causal Representation Learning
For G ie , we denote the parents of Y as E = {S, X, Z 1 , . . ., Z m }.The direct causal effects towards Y are linear transformations.Transformation for each edge i → Y is denoted as W iY ∈ R c×d , where i ∈ E, c is the number of classes, and d is the dimensional size.We let H i ∈ R d×k denote k representations for the node i.Then, the prediction can be obtained by summation , where refers to element-wise product, H X is the representation of the node X, and W g ∈ R c×d and σ(•) indicate a linear transformation and the sigmoid function, respectively.
To avoid any single edge dominating the generation of the logits Y x , we introduce a cross-entropy loss L iY , i ∈ E for each edge.Let L Y denote the loss for Y x , the overall loss L can be: Step 1 in Figure 2 (b) trains the above causal model, aiming to teach the model to identify the main cause (main effect) and the spurious correlations (side effect) for classification.Our proposed SCM is encoder neutral and it can be equipped with various encoders, such as BiLSTM and BERT.
Fusing Syntactic Structures Into SCM: So far we have built our unified SCM for IE tasks.On the edge S → X, we adopt different neural network architectures for RE, NER and ED.For RE, we use dependency trees to aggregate long-range relations with graph convolution networks (GCN; Kipf and Welling 2017) to obtain H X .For NER and ED, we adopt the dependency-guided concatenation approach (Jie and Lu, 2019) to obtain H X .

Inference and Counterfactual Generation
Given the above SCM, we train our neural model designed for a specific task such as ED.Step 2 in Figure 2 (c) performs inference with our proposed Interventions: For G ie , an intervention indicates an operation that modifies a subset of variables V ⊆ V to new values where each variable V i ∈ V is generated by manual manipulations.Thus, the causal dependency between V i and its parents {PA i , U i } will be cut off, as shown in Figure 3(b).Such an intervention for one variable X ∈ V can be expressed by the do-notation do(X = x * ) where x * is the given value (Pearl, 2009).
Counterfactuals: Unlike interventions, the concept of counterfactual reflects an imaginary scenario for "what would the outcome be had the variable(s) been different".Let Y ∈ V denote the outcome variable, and let X ∈ V\{Y } denote the variable of study.The counterfactual is obtained by setting X = x * and formally estimated as: where G x * means assigning X = x * for all equations in the SCM G.We slightly abuse the notation and use Y x * as a short form for Y x * (u), since the exogenous variable u is not explicitly required here 2 .For SCM G, the counterfactual Y x * of the original instance-level prediction Y x is computed as: where f Y is the function that computes Y .Compared to the vanilla formula for Y x , we only replace its feature H X with H x * .
Counterfactual Generation: There are many language structures such as dependency and constituency trees (Marcus et al., 1993), semantic role labels (Palmer et al., 2005), and abstract meaning representations (Banarescu et al., 2013).We choose the dependency tree in our case as it can capture rich relational information and complex long-distance interactions that have proven effective on IE tasks.Counterfactuals lead us to think about: what are the key clues that determine the relations of two entities for RE, and a certain span of a sentence to be an entity or an event trigger for NER and ED respectively?As demonstrated in Figure 2 (d), we mask entities, or the tokens in the 2 Derivations are given in the supplementary materials.scope of 1 hop on the dependency tree.Then this masked sequence is fed to a BiLSTM or BERT encoder to output new contextualized representations S * , as shown in Figure 2 (d).Then we feed S * to the function of the edge S → X to get X * .This operation also aligns with a recent finding (Zeng et al., 2020) that the entity itself may be more important than context in NER.By doing so, the key clues are expected to be wiped off in the representations X * of counterfactuals, strengthening the main effect while reducing the spurious correlations.

Causal Effect Estimation
As shown in Figure 4, we estimate the causal effect in the Step 4 and use the representation of counterfactuals for a more robust prediction in Step 5.
Inspired by SGG-TDE (Tang et al., 2020b), we compare the original outcome Y x and its counterfactual Y x * to estimate the main effect so that the side effect can be alleviated with Total Direct Effect (TDE) (Pearl, 2009) 3 : As both of the context and entity (or trigger) play important roles for the classification in NER, ED, and RE, we propose a novel approach to further alleviate the spurious correlations caused by side effects, while strengthening the main effect at the same time.The interventional causal effect of the i-th entity in a sequence can be described as: 3 Derivations are given in the supplementary materials.
where α, β are the hyperparameters that balance the importance of context and entity (or trigger) for NER, ED, and RE.The first part Y x i − αY x * i indicates the main effect, which reflects more about the debiased context, while the second part W XY H x * i reflects more about the entity (or trigger) itself.Combining them yields more robust prediction by better distinguishing the main and side effect.
As shown in Step 4 of Figure 4(a), the sentence "The program was killed" produces a biased high score for the event "Life:Die" in Y x and results in wrong prediction due to the word "killed".By computing the counterfactual Y x * with "program" masked, the score for "Life:Die" remains high but the score for "SW:Quit" drops.The difference computed by Y x − αY x * may help us to correct the prediction while understanding the important role of the word "program".However, we may not only rely on the context since the entity (trigger) itself is also an important clue.To magnify the difference and obtain more robust predictions, we strengthen the impact of entity (trigger) on the final results by W XY H x * i as shown in Step 5 of Figure 4(b).Such a design differs from SGG-TDE (Tang et al., 2020a) by providing more flexible adjustment and effect estimation with hyperparameters α and β.We will show that our approach is more suitable for long-tailed IE tasks in experiments.

Datasets and Settings
We use five datasets in our experiments including OntoNotes5.0 (Pradhan et al., 2013) and ATIS (Tur et al., 2010) for the NER task, ACE2005 (Doddington et al., 2004) and MAVEN (Wang et al., 2020b) for ED, and NYT24 (Gardent et al., 2017) for the RE task.The labels in the above datasets follow long-tailed distributions.We categorize the classes into three splits based on the number of training instances per class, including Few, Medium, and Many, and also report the results on the whole dataset with the Overall setting.We focus more on Mean Recall (MR; Tang et al. 2020b) and Macro F1 (MF1), two more balanced metrics to measure the performance of long-tailed IE tasks.MR can better reflect the capability in identifying the tail classes, and MF1 can better represent the model's ability for each class, whereas the conventional Micro F1 score highly depends on the head classes and pays less attention to the tail classes.The hyperparameter α in Equation ( 6) is set as 1 for NER and ED tasks, and 0 for the RE task.We tune the optimal α on the development sets4 .

Baselines
We categorize the baselines used in our experiments into three groups and outline them as follows.
Conventional models include BiLSTM (Chiu and Nichols, 2016), BiLSTM+CRF (Ma and Hovy, 2016), C-GCN (Zhang et al., 2018), Dep-Guided LSTM (Jie and Lu, 2019), and BERT (Devlin et al., 2019).These neural models do not explicitly take the long-tailed issues into consideration.Re-weighting/Decoupling models refer to loss re-weighting approaches including Focal Loss (Lin et al., 2017), and two-stage decoupled learning methods (Kang et al., 2019) that include τnormalization, classifier retraining (cRT) and learnable weight scaling (LWS).Causal model includes SGG-TDE (Tang et al., 2020b).There are also recent studies based on the deconfounded methodology (Tang et al., 2020a;Yang et al., 2020) for images, which however seem not applicable to be selected as a causal baseline in our case for text.We ran some of the baseline methods by ourselves since they may have not been reported on NLP datasets.

Task Definitions
We show the definition of the IE sub-tasks used in our experiments as follows, including named entity recognition (NER), event detection (ED) and relation extraction (RE).
Named Entity Recognition: NER is a sequence labeling task that seeks to locate and classify named entities in unstructured text into pre-defined categories such as person, location, etc.
Event Detection: ED aims to detect the occurrences of predefined events and categorize them as triggers from unstructured text.An event trigger is defined as the words or phase that most clearly expresses an event occurrence.Taking the sentence "a cameraman died in the Palestine Hotel" as an example, the word "died" is considered as the trigger with a "Life:Die" event.
Relation Extraction: The goal of RE is to identify semantic relationships from text, given two or more entities.For example, "Paris is in France" states a "is in" relationship between two entities "Paris" to "France".Their relation can be denoted by the triple (Paris, is in, France).

Main Results
NER: Table 1 shows the comparison results on both OntoNotes5.0 and ATIS datasets.Our models perform best or achieve comparable results under most settings, including Few, Medium, Many and Overall.For example, our model achieves more than 8 points higher MR comparing with the C-GCN model under the Few setting with Glove embeddings on the both of the two benchmarks.The results show the superiority of CFIE in handling the instance-scarce classes for the long-tailed NER.
Comparing with a causal baseline SGG-TDE, our model consistently performs better in terms of the two metrics.The results confirm our hypothesis that language structure can help a causal model to better distinguish main effect from the side effect.CFIE also obtains large performance gains with the BERT-based encoder under most of the settings, showing the effectiveness of our approach in mitigating the bias issue with a pre-trained model.It is interesting that BERT-based models perform worse than Glove-based ones on ATIS.The reason is probably that BERT, which is trained on Wikipedia, may not perform well on a small dataset collected from a very different domain.C-GCN (Zhang et al., 2018) 24.0 26.7 51.2 52.6 Focal Loss (Lin et al., 2017) 56.0 54.6 65.7 65.5 cRT (Kang et al., 2019) 66.0 24.2 65.6 50.5 τ -Normalization (Kang et al., 2019)  with very few data points drops significantly, which is the main focus of our work.The results for relation extraction further confirm our hypothesis that the proposed CFIE is able to alleviate spurious correlations caused by unbalanced dataset by learning to distinguish the main effect from the side effect.We also observe that CFIE outperforms the previously proposed SGG-TDE by a large margin for both Few and Overall settings, i.e., 11.5 points and 3.4 points improvement in terms of MF1.This further proves our claim that properly exploring language structure on causal models will boost the performance of IE tasks.

Discussion
What are the key factors for NER?We have hypothesised that the factors, such as 2-hop and 1-hop context on the dependency tree, the entity itself, and POS feature, may hold the potential to be the key clues for NER predictions.To evaluate the impact of these factors, we first generate new sequences by masking these factors.Then we feed the generated sequences to the proposed SCM to obtain the predictions.Figure 5 illustrates how we mask the context based on a dependency tree.Figure 6 shows a qualitative example for predicting the NER tag for the entity "malacca".It visualizes the variances of the predictions, where the histograms in the left refer to prediction probabilities for the ground truth class, while the histograms in the right are the max predictions except the results of ground truth class.For example, the "Mask 2-hop" operation with a blue rectangle in Figure 5 masks tokens "showed" and "Li" on the dependency tree, and the corresponding prediction probability distribution is given in Figure 6, which is expressed as the blue bar.We observe that masking the entity, i.e., "malacca", will lead to the most significant performance drop, indicating that entity itself plays a key role for the NER task.This also inspires us to design a more robust debiasing method as shown in Step 5 in our framework.
Does the syntax structure matter?To answer this question, we design three baselines including: 1) Causal Models w/o Syntax that doesn't employ dependency trees during the training stage, and only uses it for generating counterfactuals, 2) Counterfactuals w/o Syntax that employs dependency structures for training but uses a null input as the intervention during the inference state.We refer such a setting from the previous study (Tang et al., 2020a), and 3) No Syntax that is the same as the previous work SGG-TDE (Tang et al., 2020b) which doesn't involve dependency structures in both training and inference stages.As shown in Figure 7, our model outperforms all three baselines on the ACE2005 dataset under both Few and All settings, demonstrating the effectiveness of dependency structure in improving the causal models for the long-tailed IE tasks both in the training and inference stages.

How do various interventions and SCMs affect performance?
We study this question on ACE2005 dataset for ED task.We design three interventional methods including 1)   w/o NER and POS. Figure 9 shows that removing the NER node will significantly decrease the ED performance, especially over the Few setting.
The results prove the superiority of our proposed SCM that explicitly involves linguistic features to calculate main effect.
How does the hyper-parameter β impact the performance?To evaluate the impact of β on the performance, we tuned the parameter on four datasets including OntoNotes5.0, ATIS, ACE2005, and MAVEN.As shown in Figure 10, when increasing β from 0 to 2.4 on ATIS dataset, the F1 scores increase dramatically then decrease slowly.The F1 scores reach the peak when β is 1.2.As the value of β represents the importance of entity for classifications, we therefore draw a conclusion that, for NER task, an entity plays a relatively more important role than the context (Zeng et al., 2020).We observe that the performance significantly drops when β is 0. This suggests that directly applying previous causal approach (Tang et al., 2020b) may not yield good performance.The result further confirms the effectiveness of Step 5 in CFIE.

Case Study
Figure 11 shows two cases to visualize the predictions of baseline models and our CFIE model for long-tailed NER and ED, respectively.We use the "BIOES" tagging scheme for both cases and choose Dep-Guided LSTM (Jie and Lu, 2019) and SGG-TDE (Tang et al., 2020b) as baselines.In the first case for NER, the baseline assigns "chinese" with the label S-NORP, which indicates "nationalities or religious or political groups", while the corrected annotation is "S-LANGUAGE".This is caused by the spurious correlations between "chinese" and S-NORP learned from unbalanced data.For this case, there are 568 and 20 training instances for S-NORP and S-LANGUAGE, respectively.The numbers of training instances for each type are indicated in the third column of Figure 11.The numbers in the 4-th to 6-th columns indicate the probability of the token "chinese" predicted as a certain label.For example, in the 6-th column "Predictions of CFIE", the prediction probability is 35.17% for the label S-LANGUAGE.In the second case for ED, we demonstrate a similar issue for the trigger word "attack", and compare it with the two baselines.For both cases, previous SGG-TDE outputs relatively unbiased predictions compared with Dep-Guided LSTM, although the predictions are also incorrect.
Our CFIE model can obtain correct results for both instances, showing the effectiveness of our novel debiasing approach.Compared to CFIE, the inferior performance of SGG-TDE is due to ignoring the importance of entity (trigger) for the NER and ED tasks.
Causal Inference: Causal inference (Pearl et al., 2016;Rubin, 2019) has been applied in many areas, including visual tasks (Tang et al., 2020b;Abbasnejad et al., 2020;Niu et al., 2021;Yang et al., 2020;Zhang et al., 2020a;Yue et al., 2020;Yang et al., 2021;Nan et al., 2021b), model robustness and stable learning (Srivastava et al., 2020;Zhang et al., 2020a;Shen et al., 2020;Yu et al., 2020;Dong et al., 2020), generation (Wu et al., 2020), language understanding (Feng et al., 2021b), and recommendation systems (Jesson et al., 2020;Feng et al., 2021a;Zhang et al., 2021d;Wei et al., 2021;Wang et al., 2021a;Tan et al., 2021;Wang et al., 2021b;Ding et al., 2021).Works most related to ours are (Zeng et al., 2020;Wang and Culotta, 2021) that generates counterfactuals for weakly-supervised NER and text classifications, respectively.Our method is remotely related to (Tang et al., 2020b) proposed for image classifications.The key differences between our methods and previous ones: 1) counterfactuals in our method are generated by a task-specific pruned dependency structure on various IE tasks.While in previous works, counterfactuals are generated by replacing the target entity with another entity or their antonyms (Zeng et al., 2020;Wang and Culotta, 2021), or simply masking the targeting objects in an image (Tang et al., 2020b).These method do not consider the complex language structure that has been proven useful for IE tasks.2) compared with previous method SGG-TDE (Tang et al., 2020b), our inference mechanism is more robust for various IE tasks, simultaneously mitigating the spurious correlations and strengthening salient context.

Concluding Remarks
This paper presents CFIE, a novel framework for tackling the long-tailed IE issues in the view of causal inference.Extensive experiments on three popular IE tasks, named entity recognition, event detection, and relation extraction, show the effectiveness of our method.Our CFIE model provides a new perspective on tackling spurious correlations by exploring language structures based on structured causal models.We believe that our models may also find applications in other NLP tasks that suffer from spurious correlation issues caused by unbalanced data distributions.Our future work includes developing more powerful causal models for the long-tailed distribution problems using the task-specific language structures learned from the data.We are also interested in addressing the spurious correlations in various vision and language tasks (Nan et al., 2021b;Li et al., 2021;Xu et al., 2021a;Fan et al., 2020;Liu et al., 2021;Zhang et al., 2021a,b;Chen et al., 2020Chen et al., , 2021)).

Figure 1 :
Figure 1: Class distribution of the ACE2005 dataset.

Figure 3 Figure 2 :Figure 3 :
Figure 3 presents our unified SCM G ie for IE tasks based on our prior knowledge.The variable S indicates the contextualized representations of an input

Figure 4 :
Figure 4: Causal effect estimation: (a) Step 4 computes the TDE by subtraction of outputs for each token in Step 2 and 3, e.g., Y xi − αY x * i .(b) Step 5 obtains more robust predictions by highlighting each token's counterfactual representations, e.g., W XY H x * i for killed.

Figure 11 :
Figure11: Two cases selected from OntoNote5.0 and MAVEN for NER and ED tasks respectively, with unbalanced distributions for the targeting entity and event trigger.The two baseline models Dep-Guided LSTM and SGG-TDE tend to predict incorrect results caused by the spurious correlations, while our proposed CFIE model is able to yield better predictions.

Table 1 :
Evaluation results on the OntoNotes5.0 dataset and ATIS datasets for the NER task.

Table 2 :
Evaluation results on the ACE2005 and MAVEN datasets for event detection.

Table 2
RE:As shown in Table3, we further evaluate CFIE on the NYT24 dataset.Our method significantly outperforms all other methods in MR and MF1 for tail classes.The overall performance is also competitive.Although Focal Loss achieves the best overall scores, its ability to handle the classes

Table 3 :
Results on the NYT24 dataset for RE.
(Kang et al., 2019)ed weights to the losses of training samples from each class to boost the discriminability via robust classifier decision boundaries.Another line is decoupling approaches(Kang et al., 2019)that decouple the representation learning and the classifier by direct