Separating Retention from Extraction in the Evaluation of End-to-end Relation Extraction

State-of-the-art NLP models can adopt shallow heuristics that limit their generalization capability (McCoy et al., 2019). Such heuristics include lexical overlap with the training set in Named-Entity Recognition (Taille et al., 2020) and Event or Type heuristics in Relation Extraction (Rosenman et al., 2020). In the more realistic end-to-end RE setting, we can expect yet another heuristic: the mere retention of training relation triples. In this paper we propose two experiments confirming that retention of known facts is a key factor of performance on standard benchmarks. Furthermore, one experiment suggests that a pipeline model able to use intermediate type representations is less prone to over-rely on retention.


Introduction
Information Extraction (IE) aims at converting the information expressed in a text into a predefined structured format of knowledge. This global goal has been divided into subtasks easier to perform automatically and evaluate. Hence, Named Entity Recognition (NER) and Relation Extraction (RE) are two key IE tasks among others such as Coreference Resolution (CR), Entity Linking or Event Extraction. Traditionally performed as a pipeline (Bach and Badaskar, 2007), these two tasks can be tackled jointly in order to model their interdependency, alleviate error propagation and obtain a more realistic evaluation setting (Roth and Yih, 2002;Li and Ji, 2014).
Following the general trend in Natural Language Processing (NLP), the recent quantitative improvements reported on Entity and Relation Extraction benchmarks are at least partly explained by the use of larger and larger pretrained Language Models (LMs) such as BERT (Devlin et al., 2019) to obtain contextual word representations. Concurrently, Code for reproducing our evaluation settings is available at github.com/btaille/retex there is a realization that new evaluation protocols are necessary to better understand the strengths and shortcomings of the obtained neural network models, beyond a single holistic metric on an hold-out test set (Ribeiro et al., 2020).
In particular, generalisation to unseen data is a key factor in the evaluation of deep neural networks. It is all the more important in IE tasks that revolve around the extraction of mentions: small spans of words that are likely to occur in both the evaluation and training datasets. This lexical overlap has been shown to be correlated to neural networks performance in NER (Augenstein et al., 2017;Taillé et al., 2020a). For pipeline RE, Rosenman et al. (2020) and Peng et al. (2020) expose shallow heuristics in neural models: relying too much on the type of the candidate arguments or on the presence of specific triggers in their contexts.
In end-to-end Relation Extraction, we can expect that these NER and RE heuristics are combined. In this work, we argue that current evaluation benchmarks measure both the desired ability to extract information contained in a text but also the capacity of the model to simply retain labeled (head, predicate, tail) triples during training. And when the model is evaluated on a sentence expressing a relation seen during training, it is hard to disentangle which of these two behaviours is predominant. However, we can hypothesize that the model can simply retrieve previously seen information acting like a mere compressed form of knowledge base probed with a relevant query. Thus, testing on too much examples with seen triples can lead to overestimate the generalizability of a model.
Even without labeled data, LMs are able to learn some relations between words that can be probed with cloze sentences where an argument is masked (Petroni et al., 2019). This raises the additional question of lexical overlap with the orders of magnitude larger unlabeled LM pretraining corpora that will remain out of scope of this paper.
PURE (Zhong and Chen, 2021) follows the pipeline approach. The NER model is a classical span-based model (Sohrab and Miwa, 2018). Special tokens corresponding to each predicted entity span are added and used as representation for Relation Classification. For a fairer comparison with other models, we study the approximation model that only requires one pass in each encoder and limits to sentence-level prediction. However, it still requires finetuning and storing two pretrained LMs instead of a single one for the following models.
SpERT (Eberts and Ulges, 2020) uses a similar span-based NER module. RE is performed based on the filtered representations of candidate arguments as well as a max-pooled representation of their middle context. While Entity Filtering is close to the pipeline approach, the NER and RE modules share a common entity representation and are trained jointly. We also study the ablation of the max-pooled context representation that we denote Ent-SpERT.
Two are better than one (TABTO) (Wang and Lu, 2020) intertwines a sequence encoder and a table encoder in a Table Filling approach (Miwa and Sasaki, 2014). Contrary to previous models the pretrained LM is frozen and both the final hidden states and attention weights are used by the encoders. The prediction is finally performed by a Multi-Dimensional RNN (MD-RNN). Because it is not based on span-level predictions, this model cannot detect nested entities, e.g. on SciERC.
3 Partitioning by Lexical Overlap Following (Augenstein et al., 2017;Taillé et al., 2020a), we partition the entity mentions in the test set based on lexical overlap with the training set. We distinguish Seen and Unseen mentions and also extend this partition to relations. We denote a relation as an Exact Match if the same (head, predicate, tail) triple appears in the train set; as a Partial 1 More implementation details in Appendix A Match if one of its arguments appears in the same position in a training relation of same type; and as New otherwise.
We implement a naive Retention Heuristic that tags an entity mention or a relation exactly present in the training set with its majority label. We report micro-averaged Precision, Recall and F1 scores for both NER and RE in Table 1.
An entity mention is considered correct if both its boundaries and type have been correctly predicted. For RE, we report scores in the Boundaries and Strict settings (Bekoulis et al., 2018;Taillé et al., 2020b). In the Boundaries setting, a relation is correct if its type is correct and the boundaries of its arguments are correct, without considering the detection of their types. The Strict setting adds the requirement that the entity type of both argument is correct.

Dataset Specificities
We first observe very different statistics of Mention and Relation Lexical Overlap in the three datasets, which can be explained by the singularities of their entities and relations. In CoNLL04, mentions are mainly Named Entities denoted with proper names while in ACE05 the surface forms are very often common names or even pronouns, which explains the occurrence of training entity mentions such as "it", "which", "people" in test examples. This also leads to a weaker entity label consistency (Fu et al., 2020a): "it" is labeled with every possible entity type and appears mostly unlabeled whereas a mention such as "President Kennedy" is always labeled as a person in CoNLL04. Similarly, mentions in SciERC are common names which can be tagged with different labels and they can also be nested. Both the poor label consistency as well as the nested nature of entities hurt the performance of the retention heuristic.
For RE, while SciERC has almost no exact overlap between test and train relations, ACE05 and CoNLL04 have similar levels of exact match. The larger proportion of partial match in ACE05 is explained by the pronouns that are more likely to co-occur in several instances. The difference in performance of the heuristic is also explained by a poor relation label consistency.

Lexical Overlap Bias
As expected, this first evaluation setting enables to expose an important lexical overlap bias, already discussed in NER, in end-to-end Relation Extraction. On every dataset and for every model micro F1 scores are the highest for Exact Match relations, then Partial Match and finally totally unseen relations. This is a first confirmation that retention plays an important role in the measured overall performance of end-to-end RE models.

Model Comparisons
While we cannot evaluate TABTO on SciERC because it is unfit for extraction of nested entities, we can notice different hierarchies of models on every dataset suggesting that there is no one-size-fits-all best model, at least in current evaluation settings. The most obvious comparison is between SpERT and Ent-SpERT where the explicit representation of context is ablated. This results in a loss of performance on the RE part and especially on partially matching or new relations for which the entity representations pairs have not been seen. Ent-SpERT is particularly effective on Exact Matches on CoNLL04, suggesting its retention capability.
Other comparisons are more difficult, given the numerous variations between the very structure of each model as well as training procedures. However, the PURE pipeline setting seems to only be more effective on ACE05 where its NER performance is significantly better, probably because learning a separate NER and RE encoder enables to learn and capture more specific information for each distinctive task. Even then, TABTO yields better Boundaries performance only penalized on the Strict setting by entity types confusions. On the contrary, on CoNLL04, TABTO significantly outperforms its counterparts, especially on unseen relations. This indicates that it proposes a more effective incorporation of contextual information in this case where relation and argument types are mapped bijectively.
On SciERC, performance of all models is already compromised at the NER level before the RE step, which makes further distinction between model performance even more difficult.

Swapping Relation Heads and Tails
A second experiment to validate that retention is used as a heuristic in models' predictions is to modify their input sentences in a controlled manner

Sentence
Ground Truth Relation Original John Wilkes Booth , who assassinated President Lincoln , was an actor .
(John Wilkes Booth, Kill, President Lincoln) Swapped President Lincoln , who assassinated John Wilkes Booth , was an actor .
(President Lincoln, Kill, John Wilkes Booth)  similarly to what is proposed in (Ribeiro et al., 2020). We propose a very focused experiment that consists in selecting asymmetric relations that occur between entities of same type and swap the head with the tail in the input. If the model predicts the original triple, then it over relies on the retention heuristic, whereas finding the swapped triple is an evidence of broader context incorporation. We show an example in Table 2.
Because of the requirements of this experiment, we have to limit to two relations in CoNLL04: "Kill" between people and "Located in" between locations. Indeed, CoNLL04 is the only dataset with a bijective mapping between the type of a relation and the types of its arguments and the consistent proper nouns mentions makes the swaps mostly grammatically correct. For each relation type, we only consider sentences with exactly one instance of corresponding relation and swap its arguments. We only consider this relation in the RE scores reported in Table 3. We use the strict RE score as well as revRE which measures the extraction of the reverse relation, not expressed in the sentence.
For each relation, the hierarchy of models corresponds to the overall CoNLL04. Swapping arguments has a limited effect on NER, mostly for the "Located in" relation. However, it leads to a drop in RE for every model and the revRE score indicates that SpERT and TABTO predict the reverse relation more often than the newly expressed one. This is another proof of the retention heuristic of end-to-end models, although it might also be attributed to the language model to the language model. In particular for the "Located in" relation, swapped heads and tails are not exactly equivalent since the former are mainly cities and the latter countries.
On the contrary, the PURE model is less prone to information retention, as shown by its revRE scores significantly smaller than the standard RE scores on swapped sentences. Hence, it outperforms SpERT and TABTO on swapped sentences despite being the least effective on the original dataset.The important discrepancy in results can be explained by the different types of representations used by these models. The pipeline approach allows the use of argument type representations in the Relation Classifier whereas most end-to-end models use lexical features in a shared entity representation used for both NER and RE.
These conclusions from quantitative results are validated qualitatively. We can observe that the four predominant patterns are intuitive behaviours on sentences with swapped relations: retention of the incorrect original triple, prediction of the correct swapped triple and prediction of none or both triples. We report some examples in Table 9 and Table 10 in the Appendix.

Related Work
Several works on generalization of NER models mention lexical overlap with the training as a key indicator of performance. Augenstein et al. (2017) separate mentions in the test set as seen and unseen during training and measure out-of-domain generalization in an extensive study of two CRF based models and SENNA combining a Convolutional Neural Network with a CRF ( properties such as lexical overlap, label consistency and entity length on state-of-the-art models performance. They model these properties as continuous scores associated to each mention and bucketized for evaluation. Lexical overlap has also been mentioned in Coreference Resolution (Moosavi and Strube, 2017) where coreferent mentions tend to co-occur in the test and train sets. In this line of works, the impact of lexical overlap is measured either by separating performance depending on the property of mentions (seen or unseen) or with outof-domain evaluation with a test set from a different dataset with lower lexical overlap with the train set.
Another recently proposed method for finegrained evaluation of NLP models beyond a single benchmark score is to modify the test sentences in a controlled manner. McCoy et al. (2019) expose lexical overlap as a shallow heuristic adopted by state-of-the-art Natural Language Inference models, especially by swapping subject and object of verbs in the hypothesis of some examples where the premise entails the hypothesis. While such a modification changes the label of these examples to non-entailment, all models tested show a spectacular drop of accuracy on these models. Ribeiro et al. (2020) propose a broader set of test set modifications to individually test robustness of NLP models to several patterns such as the introduction of negation, swapping words with synonyms, changing tense and much more.
In pipeline RE where ground truth candidate arguments are given, models often use intermediate representations based on entity types that reduce lexical overlap issues. However, Rosenman et al. (2020) show that they still tend to adopt shallow heuristics based on the type of the arguments and the presence of triggers indicative of the presence of a relation. They propose hard cases with several mentions of same types for which Relation Classifiers struggle connecting the correct pair. Concurrently, Peng et al. (2020) confirm that RE benchmarks present shallow cues such as the type of the candidate arguments that can be used alone to infer the relation.
We propose to extend previous work on NER and RE to the more realistic end-to-end RE setting with two of the previously described approaches: 1) separating performance by lexical overlap of mentions or argument pairs and 2) modifying some CoNLL04 test examples by swapping relations heads and tails.

Conclusion
In this paper, we study three state-of-the-art endto-end Relation Extraction models in order to highlight their tendency to retain seen relations. We confirm that retention of seen mentions and relations play an important role in overall RE performance and can explain the relatively higher scores on CoNLL04 and ACE05 compared to SciERC. Furthermore, our experiment on swapping relation heads and tails tends to show that the intermediate manipulation of type representations instead of lexical features enabled in the pipeline PURE model makes it less prone to over-rely on retention.
While the limited extend of our swapping experiment is an obvious limitation of this work, it shows limitations of both current benchmarks and models. It is an encouragement to propose new benchmarks that might be easily modified by design to probe such lexical overlap heuristics. Contextual information could for example be contained in templates of that would be filled with different (head, tail) pairs either seen or unseen during training.
Furthermore, pretrained Language Models can already capture relational information between phrases (Petroni et al., 2019) and further experiments could help distinguish their role in the retention behaviour of RE models.

A Implementation Details
For every model, we use the original code associated with the papers with the default best performing hyperparameters unless stated otherwise. We run 5 runs on a single NVIDIA 2080Ti GPU for each of them on each dataset. For CoNLL04 and ACE05, we train each model with both the cased and uncased versions of BERT BASE and only keep the best performing setting.
PURE (Zhong and Chen, 2021) 1 We use the approximation model and limit use a context window of 0 to only use the current sentence for prediction and be able to compare with other models. For ACE05, we use the standard bert-base-uncased LM but use the bert-base-cased version on CoNLL04 which results in a significant +2.4 absolute improvement in RE Strict micro F1 score.
SpERT (Eberts and Ulges, 2020) 2 We use the original implementation as is with bert-base-cased for both ACE05 and CoNLL04 since the uncased version is not beneficial, even on ACE05 where there are fewer proper nouns. For the Ent-SpERT ablation, we simply remove the max-pooled context representation from the final concatenation in the RE module. This modifies the RE classifier's input dimension from the original 2354 to 1586.
Two are better than one (TABTO) (Wang and Lu, 2020) 3 We use the original implementation with bert-base-uncased for both ACE05 and CoNLL04 since the cased version is not beneficial on CoNLL04.

B Datasets Statistics
We present general datasets statistics in Table 4. We also compute average values of some entity and relation attributes inspired by (Fu et al., 2020a) and reported in Table 5.
We report two of their entity attributes: entity length in number of tokens (eLen) and entity label consistency (eCon). Given a test entity mention, its label consistency is the number of occurrences in the training set with the same type divided by its total number of occurrences. It is zero for unseen mentions. Because eCon reflects both the ambiguity of labels for seen entities and the proportion of unseen entities, we propose to introduce the eCon* 1 github.com/princeton-nlp/PURE 2 github.com/lavis-nlp/spert 3 github.com/LorrinWWW/two-are-better-than-one  score that only averages label consistency of seen mentions and eLex, the proportion of entities with lexical overlap with the train set. We introduce similar scores for relations. Relation label consistency (rCon) extends label consistency for triples. Argument types label constitency (aCon) considers the labels of every pair of mentions of corresponding types in the training set. Because pairs of types are all seen during training we do not decompose aCon into aCon* and aLex. Argument length (aLen) is the sum of the lengths of the head and tail mentions. Argument distance (aDist) is the number of tokens between the head and the tail of a relation.
We present a more complete report of overall Precision, Recall and F1 scores that can be interpreted in light of these statistics in Table 6