How Fragile is Relation Extraction under Entity Replacements?

Relation extraction (RE) aims to extract the relations between entity names from the textual context. In principle, textual context determines the ground-truth relation and the RE models should be able to correctly identify the relations reflected by the textual context. However, existing work has found that the RE models memorize the entity name patterns to make RE predictions while ignoring the textual context. This motivates us to raise the question: are RE models robust to the entity replacements? In this work, we operate the random and type-constrained entity replacements over the RE instances in TACRED and evaluate the state-of-the-art RE models under the entity replacements. We observe the 30% - 50% F1 score drops on the state-of-the-art RE models under entity replacements. These results suggest that we need more efforts to develop effective RE models robust to entity replacements. We release the source code at https://github.com/wangywUST/RobustRE.


Introduction
Recent literature has shown that the sentence-level relation extraction (RE) models may overly rely on entity names for RE instead of reasoning from the textual context (Peng et al., 2020;Wang et al., 2022).This problem is also known as entity bias (Longpre et al., 2021;Qian et al., 2021;Xu et al., 2022;Wang et al., 2022): the spurious correlation between entity names and relations.This motivates us to raise a question: "how robust are RE models under entity replacements?"Entity bias degrades the RE models' generalization, such that the entity names can mislead the models to make wrong predictions.However, a seemingly conflicting phenomenon is that RE models exhibit high (in-distribution) accuracy on standard benchmarks, such as TACRED.In our work, we find that these benchmarks are prone to have * The work was done prior to joining Amazon.shortcuts from entity names to ground-truth relations (see Fig. 2), low entity diversity, and a large portion of incorrect entity annotations.These issues suggest that, given the presence of entity bias, the current benchmarks are not challenging enough to evaluate the generalization of RE in practice.
Most existing methods for evaluating the generalizability of NLP focus on sentence classification (Jin et al., 2020;Li et al., 2020;Minervini and Riedel, 2018) and question answering (Jia and Liang, 2017;Ribeiro et al., 2018;Gan and Ng, 2019), but these methods lack special designs to seize on the entity bias in RE.In this work, we propose a type-constrained and random entity replacement method: ENTRE.Type-constrained means we replace the named entity in the type [PERSON] or [ORGANIZATION] with the new entity belonging to the same type as the original entity.Random means we randomly select the entity names from a Wikipedia entity lexicon that consists of 24,933 organizations and 902,007 person entities for replacements.These two principles guarantee the effectiveness of entity replacement to produce valid and diverse RE instances.
We apply ENTRE to TACRED to produce ENTRED, a challenging RE benchmark with fewer shortcuts and higher entity diversity.We evaluate the RE models on the instances with replaced entity names produced by ENTRE.
We analyze the RE models under entity replacements in order to answer four research questions: (Q1) Does ENTRE reduce prediction shortcuts from entity names to the ground-truth relations?(Q2) Does ENTRE improve the entity diversity?(Q3) How do the strong RE models perform under entity replacements?(Q4) How to improve the generalization of RE?
We observe several key findings.First, ENTRE reduces the shortcuts by more than 50% on many relations, and improves the subject name diversity by more than 25 times compared to TACRED.Second, the strong RE models LUKE (Yamada et al., 2020) and IRE (Zhou and Chen, 2021) tend to memorize entity-relation patterns to infer the relation instead of reasoning based on the textual context that actually describes the relation.This phenomenon causes the model to be brittle to entity replacements, resulting in a significant performance drop of 30% -50% in terms of the F1 score.Third, the recent causal inference approach CoRE (Wang et al., 2022) improves the robustness at a higher magnitude than other methods.We believe the proposed benchmark ENTRED and ENTRE will benefit future research toward improving the robustness of RE.

Analysis of Entity Names in TACRED
Before building ENTRED, we first analyze the existing popular RE datasets.Our analysis is focused on the following three perspectives: 1) the correctness of entity name annotations; 2) the diversity of entity names; 3) the prediction shortcuts from entity names to the ground-truth relations.
In the popular TACRED (Zhang et al., 2017), TACREV (Alt et al., 2020), and Re-TACRED (Stoica et al., 2021) datasets, we find that: first, there exist some portion of incorrect entity name annotations; second, many entity names are reused more than one hundred times across instances; third, the entity names in more than 70% of the instances act as shortcuts to the ground-truth relations.We introduce the details as follows.

Incorrect Entity Annotations
In the TACRED (Zhang et al., 2017), TACREV (Alt et al., 2020), and Re-TACRED (Stoica et al., 2021) datasets, there exist quite a few incorrect entity annotations.To detect these incorrect entity annotations, we use a BERT based NER model (Devlin et al., 2019) to automatically annotate the subject and object entity names in the TACRED dataset.Then, we conduct manual investigation on the entities where the NER annotations are different the original TACRED annotations.We find that more than 10% of the test instances contain incorrect entity annotations. 1 We present two examples in Fig. 3. Using these mistaken entity annotations to evaluate the RE models compromise our goal of correctly measuring RE performance.

Diversity of Entity Names
The TACRED, TACREV, and Re-TACRED datasets have low diversity of entity names: most entity names repeatedly appear in a large portion of instances (see Fig. 4).In the TACRED datasets, there are only 420 entity names repeatedly appearing as 15509 instances' subjects.For example, "ShopperTrak", as the subject, has repeatedly appeared as the subject entity in 270 instances.This heavily repeated use of entity names increases the risk that RE relies on entity bias to make RE predictions.Also, with these benchmarks, it is impossible  to comprehensively evaluate the generalization of RE models on a diverse set of entity names to imitate real-world scenarios.

Causal Inference for Entity Bias
We follow the prior work (Wang et al., 2022) to analyze the entity bias based on causal inference.(Wang et al., 2022)  Based on the causal graph displayed in Figure 5, we can diagnose whether the entities have shortcuts to relation.Wang et al. (2022) distill the entity bias by counterfactual analysis, which assigns the hypothetical combination of values to variables in a way that is counter to the empirical evidence obtained from data.We mask the tokens in X to conduct the intervention X = x on X, while keeping the variable E as the original entity mentions e.In this way, the textual context is removed and the entity information is maintained.Accordingly, the counterfactual prediction is denoted as Y x,e (see Figure 2).Y x,e refers to the output, i.e., a probability distribution or a logit vector, where only the entity mentions are given.

Shortcuts to the Ground-Truth Relations
Existing work has found that the popular RE benchmarks' test sets provide abundant shortcuts from entity names to ground-truth relations (Wang et al., 2022;Peng et al., 2020).In other words, on many instances, the model need not "extract" the relation from the textual context but can infer the correct prediction directly through shortcuts from entities.
To verify these observations, we conduct a preliminary study of the shortcuts using the strong RE model LUKE (Yamada et al., 2020) on the TA-CRED dataset.We first compute the instance-wise relation extraction result in the TACRED's test set.Then, we analyze the shortcuts from entity names to the relations based on causal inference (see details in Sec.2.3).We find that there exists a large portion of instances having shortcuts from entity names to the ground-truth relations.We visualize the ratio of instances that present shortcuts in different relations in Fig. 6.Last but not least, we observe similar phenomena on other models and TACREV, Re-TACRED datasets as well.
The analyses suggest that these benchmarks do not accurately evaluate the "extraction" capability of RE models without the shortcuts from entity names.In our words, the standard RE benchmarks are not challenging enough to evaluate whether the RE models can extract the correct relations from the textual context.In our work, we replace the entity names to reduce the shortcuts, to mitigate the possibility that RE models rely on the shortcut of entity bias to achieve over-optimistically high RE performance.Our ENTRED is able to better simulate real-world scenarios with fewer shortcuts and higher entity diversity, which is a better evaluation of the generalization of RE models.

Entity Replacement for RE
We present ENTRE: a simple yet effective procedure to generate high-quality RE instances with entity replacements.ENTRE replaces entity names in the RE instances in a random and typeconstrained manner.We apply ENTRE to the test set of TACRED to evaluate the state-of-the-art RE models' robustness under entity replacements.

Targetting the Instances for Replacements
We desire entity replacements to not affect the soundness of language.As we have analyzed in Sec.2.1, there exist a significant amount of incorrect entity annotations.To handle these incorrect entity annotations, we use a BERT based NER model (Devlin et al., 2019) to re-annotate the entities in the TACRED datasets.Then, we further conduct manual investigation over the entity annotations.We filter out incorrect entity annotation instances and only replace the tokens that belong to named entities.This ensures that our entity name replacements do not alter the ground-truth relation labels.
Besides the incorrect entity annotations, there are also some entities for which replacement may inevitably cause noise.For example, some entities belong to the [MISC] (miscellenous) class.If we replace a [MISC] entity with another [MISC] one, it is likely that we will break the semantics of the original sentence.In contrast, replacing the [PERSON] and [ORGANIZATION] entities with those belonging to the same type generally do not affect the groundtruth relations.We notice that all the instances in TACRED have a [PERSON] or [ORGANIZATION] entity as the subject or object.Therefore, in our work, we focus on replacing the [PERSON] and [ORGANIZATION] entities.

A Large Lexicon of Entities
We set the standards for the new entity names selected for replacements: 1.The new entity belong to the same type as the replaced one.
2. The new entity names are more diverse.
These two principles contribute to making the resulting instances natural -i.e., containing real, valid entities that are of the same class as the original entities, and are linguistically sound; challenging -i.e., the new entities may not offer shortcuts to the model, which cannot easily get the correct extraction result by seeing only the entity names and comprehensive -i.e., the robustness of RE is evaluated on a more diverse set of entities.
To satisfy the above principles, we first build up a large entity name lexicon to provide the new entity names for replacements.The size of the entity name lexicon determines the diversity of entity names in our new RE benchmark ENTRED.Also, a larger entity name lexicon can help us to evaluate the generalization of RE models on more out-ofdomain entity names in test time.Therefore, in addition to the entity names appearing in the TA-CRED, we collect the entity names from Wikipedia belonging to the category of person and organization to enrich the entity name corpus.Overall, we collect 24,933 organization and 902,007 person names from Wikipedia.2

Entity Replacements
Based on the constructed entity lexicon, we propose ENTRE: a type-constrained and random entity replacement method.Type-constrained means we replace the named entity in the type [PERSON] or [ORGANIZATION] with the new entity belonging to the same type as the original entity.Random means we randomly select the entity names from our entity lexicon that consist of 24,933 organizations and 902,007 person entities for replacements.These two principles guarantee the effectiveness of entity replacement to produce valid RE instances.We iterate over TACRED instances and replace the entity names.We summarize ENTRE as the following pipeline: 1. Collecting the instances with predictions as same as the ground-truth relation.2. Replace the entity names for the collected entities in Step 1. Repeat step 1.
The above steps can be repeated for many times, and a higher repetition time leads to a higher level of the adversary.We can stop the repeating until all the entities in the lexicon have been used.But that will induce too long running time.Therefore, in our work, we set the maximum number of repetitions as 200.
Step 1 requires the inference on many test instances, which is time-consuming.Considering that the F1 score's calculation of RE takes the "no_relation" as the background class, we can alternatively collect the instances not belonging to the "no_relation" class in Step 1.We denote such an alternate as ENTRE-fast, which saves 90% evaluation time in the experiments.
We create the challenging RE benchmark ENTRED based on the public benchmark TACRED by applying ENTRE on the test set of TACRED.The overall statistics of ENTRED are shown in Table 3, alongside the statistics of the original TACRED dataset.The number of sentences in ENTRED is slightly smaller than that in TACRED because we filter out the instances having incorrect entity annotations.We showcase ENTRE using TACRED in this paper because of its popularity on evaluating RE models and comprehensive relation-type coverage.However, our ENTRE can be applied to other RE datasets.

Experiments
In this section, we investigate ENTRE and use it to evaluate the robustness of the strong RE models LUKE (Yamada et al., 2020), IRE (Zhou and Chen, 2021), and other methods that can improve the robustness of RE.Our experimental settings closely follow those of previous work (Zhang et al., 2017;Zhou and Chen, 2021;Nan et al., 2021) to ensure a fair comparison.We organize our results and analysis as four main research questions and their answers.Q1: Does ENTRE reduce shortcuts?
ENTRE leads to fewer shortcuts from entity names to ground-truth relations We perform causal inference over ENTRED to analyze how many instances have shortcuts from entity names to the ground-truth relations after the entity replacements.
We present the comparison of the shortcut ratio on ENTRED and TACRED on different relations in Fig. 7.We observe that ENTRED greatly reduces the shortcuts for more than 50% instances on most relations.As a result, when being evaluated using ENTRED, RE models have to extract the informative signals describing the ground-truth relations from the textual context, rather than rely on the shortcuts from the entity names.
Comparison between ENTRED and existing benchmarks.As we have analyzed in Sec.2.1, the diversity of entity names in the existing benchmarks TACRED, TACREV and Re-TACRED are rather limited.These limitations hinder the evaluation of the generalization and generalization of RE.In our work, thanks to our larger lexicon built from the Wikipedia entity names, our ENTRED have much higher diversity than the TACRED and Re-TACRED, as shown in Fig. 8.With these diverse entity names, ENTRED is able to evaluate the performance of RE models on a larger scale of diverse entities, which better imitates the real-world scenario.
Q3: How robust is RE under entity replacements?

Main Results
We evaluate the robustness of the state-of-the-art RE models LUKE (Yamada et al., 2020) and IRE (Zhou and Chen, 2021) under entity replacements.Our experimental settings closely follow those of previous work (Zhang et al., 2017;Zhou and Chen, 2021;Nan et al., 2021) to ensure a fair comparison.We visualize the empirical results in Fig. 1.We observe that the 30% -50% drops in terms of F1 scores happen on the state-of-the-art RE models after entity replacements.These results suggest that there remains a large gap between the current research and the really effective RE models robust to entity replacements.
We show the F1 scores on ENTRED in Table 2.We can see that the state-of-the-art LUKE has a significant performance drop in our challenging ENTRED; there is a 44% relative decrease (in the models' F1) in ENTRED as compared to their results before entity replacements.

Case Study
We conduct case studies to empirically examine the effects of our entity replacements of ENTRE.Table 3 gives a qualitative comparison example between the RE results on Re-TACRED and our ENTRED.The results show that our ENTRE misleads the strong RE model LUKE to predict incorrect relations.For example, given the TACRED instance "Finance Ministry spokesperson Chileshe Kandeta who confirmed this on Sunday said Magande signed a loan agreement of 31 million dollars with the ADF for the country 's :::::: Poverty :::::::::: Reduction :::::: Budget :::::::: Support.",there is no relation between the subject and object existing in the text.After the entity replacement, LUKE believes that the relation between them is "members".The entity bias can account for this result, where given only the entity mentions American Association of University Women and Willingboro Chapter, the RE model returns the relation "members" without any textual context.This implies that the model makes the prediction for the original input relying on the entity mentions, which leads to the wrong RE prediction.In our work, we replace the original entities with the new ones that convey the entity bias different from the ground-truth label to test the generalization of RE models under entity bias.

Memorizing or Reasoning?
We propose EN-TRE to test the ability to use the textual context to infer the relations.As the entity replacements of ENTRE do not affect the ground-truth relations, RE models should be robust against entity name changes.However, we observe the large performance drops from our entity replacements.
Therefore, we conclude that the strong RE model LUKE is apt to memorize the entity name patterns for predicting relations and is more brittle when the entities that convey the biases different from the ground-truth relations existing in the input text.To make RE models more robust, we believe an important future direction is to develop context-based reasoning approaches, taking advantage of inductive biases on the textual context that determines the relations.

Q4: How to improve the generalization?
Methods In our work, we consider the following methods to improve the generalization of RE: (1) Focal (Lin et al., 2017) adaptively reweights the losses of different instances so as to focus on the hard ones.(2) Resample (Burnaev et al., 2015) up-samples rare categories by the inversed sample fraction during training.(3) Entity Mask (Zhang et al., 2017): masks the entity mentions with special tokens to reduce the over-fitting on entities.(4) CoRE (Wang et al., 2022) is a causal inference based method that mitigates entity bias.
enhances its entity-level generalization ability and makes RE models focus more on the textual context for inference, resulting in a better generalization under entity name replacements.Other methods, however, lead to lower improvements for LUKE, potentially because they cannot effectively capture the biased patterns between relations and entity names.
Relation extraction is a sub-task of information extraction that aims to identify semantic relations between entities from natural language text (Zhang et al., 2017).It is an effective way to automatically acquire important knowledge and plays a vital role in Natural Language Processing (NLP).Relation Extraction is the key component for building relation knowledge graphs, and it is of crucial significance to natural language processing applications such as structured search, sentiment analysis, question answering, and summarization (Huang and Wang, 2017).
Early research efforts (Nguyen and Grishman, 2015; Wang et al., 2016;Zhang et al., 2017) train RE models from scratch based on lexicon-level features.The recent RE work fine-tunes pretrained language models (PLMs; Devlin et al. 2019;Liu et al. 2019).For example, K-Adapter (Wang et al., 2020) fixes the parameters of the PLM and uses feature adapters to infuse factual and linguistic knowledge.Recent work focuses on utilizing the entity information for RE (Zhou and Chen, 2021;Yamada et al., 2020), but this leaks superficial and spurious clues about the relations (Zhang et al., 2018).Despite the biases in existing RE models, scarce work has discussed the spurious correlation between entity mentions and relations that causes such biases.Our work builds an automated pipeline to generate natural instances with fewer shortcuts and large coverage at scale to reflect the serious effects of entity bias on the RE models.
There are also work in other domains aiming to evaluate models' generalization to perturbed inputs.For example, Jia and Liang (2017) attack reading comprehension models by adding word sequences to the input.Gan and Ng (2019) and Iyyer et al. (2018) paraphrase the input to test models' oversensitivity.Jones et al. (2020) target adversarial typos.Si et al. (2021) propose a benchmark for reading comprehension with diverse types of testtime perturbation.These works focus on different domains than our research does, and they do not consider the composition of RE examples.Little attention is drawn to the entities in the sentences, and many attacks (e.g.character swapping, word injection) may make the perturbed sentences invalid.To the best of our knowledge, this work is among the first to propose a straightforward, dedicated pipeline for generating natural adversarial examples for the RE task, which takes into account the serious effects of entity bias in RE models.

Conclusion
Our contributions in this paper are three-fold.1) Methodology-wise: we propose ENTRE, an endto-end entity replacement method that reduces the shortcuts from entity names to ground-truth relations.2) Resource-wise: we develop ENTRED, a straightforward method for generating natural and counterfactual entity replacements for RE, which produces ENTRED, a benchmark for auditing the generalization of RE models under entity bias.
3) Evaluation-wise: our experimental results and analysis provide answers to four main research questions on the generalization of RE.We believe ENTRED and the entity replacement method EN-TRE can benefit the community working to increase the RE models' generalization under entity bias.

Figure 1 :
Figure 1: The performance of state-of-the-art RE models drop a lot under entity replacements.

Figure 2 :
Figure 2: TACRED offers many shortcuts from entity names to ground-truth relations in the test set, where the model predicts the correct relation even when only given the entity names, despite all textual context being removed.As a result, it is not challenging enough to measure the generalization under entity bias.

Figure 3 :
Figure 3: Two examples of incorrect entity annotations in TACRED.

Figure 4 :
Figure 4: The number of different subject entity names (red) is much lower than the number of instances (blue) in the test sets of the TACRED, TACREV, and Re-TACRED datasets.In other words, the diversity of entity names in these datasets' test sets is limited.

Figure 5 :
Figure 5: The original causal graph of RE models (left) together with its counterfactual alternatives for the entity bias (right).The shading indicates the mask of corresponding variables.
Figure 6: The ratio of instances with shortcuts (the entity bias is as same as the ground truth relation) in the TACRED test set.

Figure 7 :Figure 8 :
Figure 7: ENTRED significantly reduces the ratio of instances with shortcuts (the entity bias is as same as the ground truth relation) compared with TACRED.

Table 1 :
Statistics of the TACRED and ENTRED benchmarks.

Table 2 :
F1 scores (%) and the performance dropping of RE on the test sets of TACRED and our ENTRED.The best results in each column are highlighted in bold font.We additionally report the performance drop (%) compared with the performance on the original TACRED dataset.

Table 3 :
A case study for LUKE on the relation extraction benchmark TACRED and our ENTRED.Underlines and ::::