A Frustratingly Easy Approach for Entity and Relation Extraction

End-to-end relation extraction aims to identify named entities and extract relations between them. Most recent work models these two subtasks jointly, either by casting them in one structured prediction framework, or performing multi-task learning through shared representations. In this work, we present a simple pipelined approach for entity and relation extraction, and establish the new state-of-the-art on standard benchmarks (ACE04, ACE05 and SciERC), obtaining a 1.7%-2.8% absolute improvement in relation F1 over previous joint models with the same pre-trained encoders. Our approach essentially builds on two independent encoders and merely uses the entity model to construct the input for the relation model. Through a series of careful examinations, we validate the importance of learning distinct contextual representations for entities and relations, fusing entity information early in the relation model, and incorporating global context. Finally, we also present an efficient approximation to our approach which requires only one pass of both entity and relation encoders at inference time, achieving an 8-16× speedup with a slight reduction in accuracy.


Introduction
Extracting entities and their relations from unstructured text is a fundamental problem in information extraction. This problem can be decomposed into two subtasks: named entity recognition (Sang and De Meulder, 2003;Ratinov and Roth, 2009) and relation extraction (Zelenko et al., 2002;Bunescu and Mooney, 2005). Early work employed a pipelined approach, training one model to extract entities (Florian et al., 2004(Florian et al., , 2006, and another model to classify relations between them (Zhou et al., 2005;Kambhatla, 2004;Chan and Roth, 2011). More recently, however, end-toend evaluations have been dominated by systems that model these two tasks jointly (Li and Ji, 2014;Miwa and Bansal, 2016;Katiyar and Cardie, 2017;Zhang et al., 2017a;Luan et al., 2018Lin et al., 2020;Wang and Lu, 2020). There has been a long held belief that joint models can better capture the interactions between entities and relations and help mitigate error propagation issues.
In this work, we re-examine this problem and present a simple approach which learns two encoders built on top of deep pre-trained language models (Devlin et al., 2019;Beltagy et al., 2019;Lan et al., 2020). The two models -which we refer them as to the entity model and relation model throughout the paper -are trained independently and the relation model only relies on the entity model to provide input features. Our entity model builds on span-level representations and our relation model builds on contextual representations specific to a given pair of spans. Despite its simplicity, we find this pipelined approach to be extremely effective: using the same pre-trained encoders, our model outperforms all previous joint models on three standard benchmarks: ACE04, ACE05 and SciERC, advancing the previous state-of-the-art by 1.7%-2.8% absolute in relation F1.
To better understand the effectiveness of this approach, we carry out a series of careful analyses. We observe that, (1) the contextual representations for the entity and relation models essentially capture distinct information, so sharing their representations hurts performance; (2) it is crucial to fuse the entity information (both boundary and type) at the input layer of the relation model; (3) leveraging cross-sentence information is useful in both tasks. Hence, we expect that this simple model will serve as a very strong baseline in end-to-end relation extraction and make us rethink the value of joint modeling of entities and relations.
Finally, one possible shortcoming of our approach is that we need to run our relation model  (Luan et al., 2018). Given an input sentence MORPA is a fully implemented parser for a text-to-speech system, an end-to-end relation extraction system is expected to extract that MORPA and PARSER are entities of type METHOD, TEXT-TO-SPEECH is a TASK, as well as MORPA is a hyponym of PARSER and MORPA is used for once for every pair of entities. To alleviate this issue, we present a novel and efficient alternative by approximating and batching the computations for different groups of entity pairs at inference time. This approximation achieves an 8-16× speedup with only a slight reduction in accuracy (e.g., 1.0% F1 drop on ACE05), which makes our model fast and accurate to use in practice. Our final system is called PURE (the Princeton University Relation Extraction system) and we make our code and models publicly available for the research community. We summarize our contributions as follows: • We present a simple and effective approach for end-to-end relation extraction, which learns two independent encoders for entity recognition and relation extraction. Our model establishes the new state-of-the-art on three standard benchmarks and surpasses all previous joint models. • We conduct careful analyses to understand why our approach performs so well and how different factors impact the final performance. We conclude that it is more effective to learn distinct contextual representations for entities and relations than to learn them jointly. • To speed up the inference time of our model, we also propose a novel efficient approximation, which achieves a large runtime improvement with only a small accuracy drop.

Related Work
Traditionally, extracting relations between entities in text has been studied as two separate tasks: named entity recognition and relation extraction. In the last several years, there has been a surge of interest in developing models for joint extraction of entities and relations (Li and Ji, 2014;Miwa and Sasaki, 2014;Miwa and Bansal, 2016). We group existing joint models into two categories: structured prediction and multi-task learning: Structured prediction Structured prediction approaches cast the two tasks into one unified framework, although it can be formulated in various ways. Li and Ji (2014) propose an action-based system which identifies new entities as well as links to previous entities, Zhang et al. (2017a); Wang and Lu (2020) adopt a table-filling approach proposed in (Miwa and Sasaki, 2014); Katiyar and Cardie (2017) and Zheng et al. (2017) employ sequence tagging-based approaches;  and Fu et al. (2019) propose graph-based approaches to jointly predict entity and relation types; and,  convert the task into a multi-turn question answering problem. All of these approaches need to tackle a global optimization problem and perform joint decoding at inference time, using beam search or reinforcement learning.
Multi-task learning This family of models essentially builds two separate models for entity recognition and relation extraction and optimizes them together through parameter sharing. Miwa and Bansal (2016) propose to use a sequence tagging model for entity prediction and a tree-based LSTM model for relation extraction. The two models share one LSTM layer for contextualized word representations and they find sharing parameters improves performance (slightly) for both models. The approach of Bekoulis et al. (2018) is similar except that they model relation classification as a multi-label head selection problem. Note that these approaches still perform pipelined decoding: entities are first extracted and the relation model is applied on the predicted entities.
The closest work to ours is DYGIE and DY-GIE++ , which builds on recent span-based models for coreference resolution (Lee et al., 2017) and semantic role labeling . The key idea of their approaches is to learn shared span representations between the two tasks and update span representations through dynamic graph propagation layers. A more recent work Lin et al. (2020) further extends DYGIE++ by incorporating global features based on cross-substask and cross-instance constraints. 2 Our approach is much simpler and we will detail the differences in Section 3.2 and explain why our model performs better.

Method
In this section, we first formally define the problem of end-to-end relation extraction in Section 3.1 and then detail our approach in Section 3.2. Finally, we present our approximation solution in Section 3.3, which considerably improves the efficiency of our approach during inference.

Problem Definition
The input of the problem is a sentence X consisting of n tokens x 1 , x 2 , . . . , x n . Let S = {s 1 , s 2 , . . . , s m } be all the possible spans in X of up to length L and START(i) and END(i) denote start and end indices of s i . Optionally, we can incorporate cross-sentence context to build better contextual representations (Section 3.2). The problem can be decomposed into two sub-tasks: Named entity recognition Let E denote a set of pre-defined entity types. The named entity recognition task is, for each span s i ∈ S, to predict an entity type y e (s i ) ∈ E or y e (s i ) = representing span s i is not an entity. The output of the task is Y e = {(s i , e) : s i ∈ S, e ∈ E}.
Relation extraction Let R denote a set of predefined relation types. The task is, for every pair of spans s i ∈ S, s j ∈ S, to predict a relation type y r (s i , s j ) ∈ R, or there is no relation between them: y r (s i , s j ) = . The output of the task is Y r = {(s i , s j , r) : s i , s j ∈ S, r ∈ R}.

Our Approach
As shown in Figure 1, our approach consists of an entity model and a relation model. The entity model first takes the input sentence and predicts an entity type (or ) for each single span. We then process every pair of candidate entities independently in the relation model by inserting extra marker tokens to highlight the subject and object and their types. We will detail each component below, and finally summarize the differences between our approach and DYGIE++ .
Entity model Our entity model is a standard span-based model following prior work (Lee et al., 2017;Luan et al., 2018. We first use a pre-trained language model (e.g., BERT) to obtain contextualized representations x t for each input token x t . Given a span s i ∈ S, the span representation h e (s i ) is defined as: where φ(s i ) ∈ R d F represents the learned embeddings of span width features. The span representation h e (s i ) is then fed into a feedforward network to predict the probability distribution of the entity type e ∈ E ∪ { }: P e (e | s i ).
Relation model The relation model aims to take a pair of spans s i , s j (a subject and an object) as input and predicts a relation type or . Previous approaches (Luan et al., 2018 re-use the span representations h e (s i ), h e (s j ) to predict the relationship between s i and s j . We hypothesize that these representations only capture contextual information around each individual entity and might fail to capture the dependencies between the pair of spans. We also argue that sharing the contextual representations between different pairs of spans may be suboptimal. For instance, the words is a in Figure 1 are crucial in understanding the relationship between MORPA and PARSER but not for MORPA and TEXT-TO-SPEECH.
Our relation model instead processes each pair of spans independently and inserts typed markers at the input layer to highlight the subject and object and their types. Specifically, given an input sentence X and a pair of subject-object spans s i , s j , where s i , s j have a type of e i , e j ∈ E ∪ { } respectively. We define text markers as S:e i , /S:e i , O:e j , and /O:e j , and insert them into the input sentence before and after the subject and object spans (Figure 1 (b)). 3 Let X denote this modified sequence with text markers inserted: We apply a second pre-trained encoder on X and denote the output representations by x t . We concatenate the output representations of two start positions and obtain the span-pair representation: where START(i) and START(j) are the indices of S:e i and O:e j in X. Finally, the representation h r (s i , s j ) will be fed into a feedforward network to predict the probability distribution of the relation type r ∈ R ∪ { }: P r (r|s i , s j ).
This idea of using additional markers to highlight the subject and object is not entirely new as it has been studied recently in relation classification (Zhang et al., 2019;Soares et al., 2019;Peters et al., 2019). However, most relation classification tasks (e.g., TACRED (Zhang et al., 2017b)) only focus on a given pair of subject and object in an input sentence and its effectiveness has not been evaluated in the end-to-end setting in which we need to classify the relationships between multiple entity mentions. We observed a large improvement in our experiments (Section 5.1) and this strengthens our hypothesis that modeling the relationship between different entity pairs in one sentence require different contextual representations. Furthermore, Zhang et al. (2019); Soares et al. (2019) only consider untyped markers (e.g., S , /S ) and previous end-toend models (e.g., ) only inject the entity type information into the relation model through auxiliary losses. We find that injecting type information at the input layer is very helpful in distinguishing entity types -for example, whether "Disney" refers to a person or an organizationbefore trying to understand the relations.
Cross-sentence context Cross-sentence information can be used to help predict entity types and relations, especially for pronominal mentions. ; Wadden et al. (2019) employ a propagation mechanism to incorporate crosssentence context.  also add a 3-sentence context window which is shown to improve performance. We also evaluate the importance of leveraging cross-sentence context in our approach. As we expect that pre-trained language models to be able to capture long-range dependencies, we simply incorporate cross-sentence context by extending the sentence to a fixed window size W for both the entity and relation model. Specifically, given an input sentence with n words, we augment the input with (W − n)/2 words from the left context and right context respectively.
Training & inference For both entity model and relation model, we fine-tune the two pre-trained language models using task-specific losses. We use cross-entropy loss for both models: where e * i represents the gold entity type of s i and r * i,j represents the gold relation type of span pair s i , s j in the training data. For training the relation model, we only consider the gold entities S G ⊂ S in the training set and use the gold entity labels as the input of the relation model. We considered training on predicted entities as well as all spans S (with pruning), but none of them led to meaningful improvements compared to this simple pipelined training (see more discussion in Section 5.3). During inference, we first predict the entities by taking y e (s i ) = arg max e∈E∪{ } P e (e|s i ). Denote S pred = {s i : y e (s i ) = }, we enumerate all the spans s i , s j ∈ S pred and use y e (s i ), y e (s j ) to construct the input for the relation model P r (r | s i , s j ).
Differences from DYGIE++ Our approach differs from DYGIE++  in the following ways: (1) We use separate encoders for the entity and relation models, without any multi-task learning. The predicted entity types are used directly to construct the input for the relation model. (2) The contextual repre-sentations in the relation model are specific to each pair of spans by using the text markers.
(3) We only incorporate cross-sentence information by extending the input with additional context (as they did) and we do not employ any graph propagation layers and beam search. 4 As a result, our model is much simpler. As we will show in the experiments (Section 4), it also achieves large gains in all the benchmarks, using the same pre-trained encoders.

Efficient Batch Computations
One possible shortcoming of our approach is that we need to run our relation model once for every pair of entities. To alleviate this issue, we propose a novel and efficient alternative to our relation model.
The key problem is that we would like to re-use computations for different pairs of spans in the same sentence. This is impossible in our original model because we must insert the entity markers for each pair of spans independently. To this end, we propose an approximation model by making two major changes to the original relation model. First, instead of directly inserting entity markers into the original sentence, we tie the position embeddings of the markers with the start and end tokens of the corresponding span: P( S:e i ), P( /S:e i ) := P(x START(i) ), P(x END(i) ) P( O:e j ), P( /O:e j ) := P(x START(j) ), P(x END(j) ), where P(·) denotes the position id of a token. As the example shown in Figure 1, if we want to classify the relationship between MORPA and PARSER, the first entity marker S: METHOD will share the position embedding with the token MOR. By doing this, the position embeddings of the original tokens will not be changed.
Second, we add a constraint to the attention layers. We enforce the text tokens to only attend to text tokens and not attend to the marker tokens while an entity marker token can attend to all the text tokens and all the 4 marker tokens associated with the same span pair. These two modifications allow us to re-use the computations of all text tokens, because the representations of text tokens are independent of the entity marker tokens. Thus, we can batch multiple pairs of spans from the same sentence in one run of the relation model. In practice, we add all marker tokens to the end of the sentence to form an input that batches a set of span pairs (Figure 1(c)). This leads to a large speedup at inference time and only a small drop in performance (Section 4.3).

Setup
Datasets We evaluate our approach on three popular end-to-end relation extraction datasets: ACE05 5 , ACE04 6 , and SciERC (Luan et al., 2018). Table 2 shows the data statistics of each dataset. The ACE05 and ACE04 datasets are collected from a variety of domains, such as newswire and online forums. The SciERC dataset is collected from 500 AI paper abstracts and defines scientific terms and relations specially for scientific knowledge graph construction. We follow previous work and use the same preprocessing procedure and splits for all datasets. See Appendix A for more details.
Evaluation metrics We follow the standard evaluation protocol and use micro F1 measure as the evaluation metric. For named entity recognition, a predicted entity is considered as a correct prediction if its span boundaries and the predicted entity type are both correct. For relation extraction, we adopt two evaluation metrics: (1) boundaries evaluation (Rel): a predicted relation is considered as a correct prediction if the boundaries of two spans are correct and the predicted relation type is correct; (2) strict evaluation (Rel+): in addition to what is required in the boundaries evaluation, predicted entity types also must be correct. More discussion of the evaluation settings can be found in Bekoulis et al. (2018); Taillé et al. (2020).
Implementation details We use bert-baseuncased (Devlin et al., 2019) and albert-xxlarge-v1 (Lan et al., 2020) as the base encoders for ACE04 and ACE05, for a fair comparison with previous work and an investigation of small vs large pre-trained models. 7 We also use scibert-scivocabuncased (Beltagy et al., 2019) as the base encoder for SciERC, as this in-domain pre-trained model is shown to be more effective than BERT . We use a context window size of W = 300 for the entity model and W = 100 for 5 catalog.ldc.upenn.edu/LDC2006T06 6 catalog.ldc.upenn.edu/LDC2005T09 7 As detailed in Table 1, some previous work used BERTlarge models. We are not able to do a comprehensive study of all the pre-trained models and our BERT-base results are generally higher than most published results using larger models.

Model
Encoder ACE05 ACE04 SciERC Ent Rel Rel+ Ent Rel Rel+ Ent Rel Rel+  Table 1: Test F1 scores on ACE04, ACE05, and SciERC. We evaluate our approach in two settings: single-sentence and cross-sentence depending on whether cross-sentence context is used or not. ♣ : These models leverage crosssentence information. † : These models are trained with additional data (e.g., coreference). The encoders used in different models: L = LSTM, L+E = LSTM + ELMo, Bb = BERT-base, Bl = BERT-large, SciB = SciBERT (size as BERT-base), ALB = ALBERT-xxlarge-v1. Rel denotes the boundaries evaluation (the entity boundaries must be correct) and Rel+ denotes the strict evaluation (both the entity boundaries and types must be correct).  the relation model in our default setting using crosssentence context 8 and the effect of different context sizes is provided in Section 5.4. We consider spans up to L = 8 words. For all the experiments, we report the averaged F1 scores of 5 runs. More implementation details can be found in Appendix B.

Main Results
Table 1 compares our approach PURE to all the previous results. We report the F1 scores in both single-sentence and cross-sentence settings. As is shown, our single-sentence models achieve strong performance and incorporating cross-sentence con-text further improves the results considerably. Our BERT-base (or SciBERT) models achieve similar or better results compared to all the previous work including models built on top of larger pre-trained LMs, and our results are further improved by using a larger encoder ALBERT.
For entity recognition, our best model achieves an absolute F1 improvement of +1.4%, +1.7%, +1.4% on ACE05, ACE04, and SciERC respectively. This shows that cross-sentence information is useful for the entity model and pre-trained Transformer encoders are able to capture long-range dependencies from a large context. For relation extraction, our approach outperforms the best previous methods by an absolute F1 of +1.8%, +2.8%, +1.7% on ACE05, ACE04, and SciERC respectively. We also obtained a 4.3% higher relation F1 on ACE05 compared to DYGIE++  using the same BERT-base pre-trained model. Compared to the previous best approaches using either global features (Lin et al., 2020) or complex neural models (e.g., MT-RNNs) (Wang and Lu, 2020), our approach is much simpler and achieves large improvements on all the datasets. Such improvements demonstrate the effectiveness  Table 3: We compare our full relation model and the approximation model in both accuracy and speed. The accuracy is measured as the relation F1 (boundaries) on the test set. These results are obtained using BERTbase for ACE05 and SciBERT for SciERC in both single-sentence and cross-sentence settings. The speed is measured on a single NVIDIA GeForce 2080 Ti GPU with a batch size of 32.
of learning representations for entities and relations of different entity pairs, as well as early fusion of entity information in the relation model. We also noticed that compared to the previous state-of-theart model (Wang and Lu, 2020) based on ALBERT, our model achieves a similar entity F1 (89.5 vs 89.7) but a substantially better relation F1 (67.6 vs 69.0) without using context. This clearly demonstrates the superiority of our relation model. Finally, we also compare our model to a joint model (similar to DYGIE++) of different data sizes to test the generality of our results. As shown in Appendix C, our findings are robust to data sizes.

Batch Computations and Speedup
In Section 3.3, we proposed an efficient approximation solution for the relation model, which enables us to re-use the computations of text tokens and batch multiple span pairs in one input sentence. We evaluate this approximation model on ACE05 and SciERC. Table 3 shows the relation F1 scores and the inference speed of the full relation model and the approximation model. On both datasets, our approximation model significantly improves the efficiency of the inference process. 9 For example, we obtain a 11.9× speedup on ACE05 and a 8.7× speedup on SciERC in the single-sentence setting. By re-using a large part of computations, we are able to make predictions on the full ACE05 test set (2k sentences) in less than 10 seconds on 9 Note that we only applied this batch computation trick at inference time, because we observed that training with batch computation leads to a slightly (and consistently) worse result. We hypothesize that this is due to the impact of increased batch sizes. We still modified the position embedding and attention masks during training (without batching the instances though). a single GPU. On the other hand, this approximation only leads to a small performance drop and the relaion F1 measure decreases by only 1.0% and 1.2% on ACE05 and SciERC respectively in the single-sentence setting. Considering the accuracy and efficiency of this approximation model, we expect it to be very effective to use in practice.

Analysis
Despite its simple design and training paradigm, we have shown that our approach outperforms all previous joint models. In this section, we aim to take a deeper look and understand what contributes to its final performance.

Importance of Typed Text Markers
Our key observation is that it is crucial to build different contextual representations for different pairs of spans and an early fusion of entity type information can further improve performance. To validate this, we experiment the following variants on both ACE05 and SciERC: TEXT: We use the span representations defined in the entity model (Section 3.2) and concatenate the hidden representations for the subject and the object, as well as their element-wise multiplication: [h e (s i ), h e (s j ), h e (s i ) h e (s j )]. This is similar to the relation model in Luan et al. (2018.

TEXTETYPE:
We concatenate the span-pair representations from TEXT with entity type embeddings ψ(e i ), ψ(e j ) ∈ R d E (d E = 150).

MARKERS:
We use untyped entity types ( S , /S , O , /O ) at the input layer and concatenate the representations of two spans' starting points.

MARKERSETYPE:
We concatenate the span-pair representations from MARKERS with entity type embeddings ψ(e i ), ψ(e j ) ∈ R d E (d E = 150).

MARKERSELOSS:
We also consider a variant which uses untyped markers but add another FFNN to predict the entity types of subject and object through auxiliary losses. This is similar to how the entity information is used in multi-task learning .
TYPEDMARKERS: This is our final model described in Section 3.2 with typed entity markers. Table 4 summarizes the results of all the variants using either gold entities or predicted entities from the entity model. As is shown, different input representations make a clear difference and the variants of using marker tokens are significantly  Table 4: Relation F1 (boundaries) on the development set of ACE05 and SciERC with different input features. e2e: the entities are predicted by our entity model; gold: the gold entities are given. The results are obtained using BERT-base with single-sentence context for ACE05 and SciBERT with cross-sentence context for SciERC. For both ACE05 and SciERC, we use the same entity models with cross-sentence context to compute the e2e scores of using different input features.  better than standard text representations and this suggests the importance of learning different representations with respect to different pairs of spans. Compared to TEXT, TYPEDMARKERS improved the F1 scores dramatically by +5.0% and +7.4% absolute when gold entities are given. With the predicted entities, the improvement is reduced as expected while it remains large enough. Finally, entity type is useful in improving the relation performance and an early fusion of entity information is particularly effective (TYPEDMARKERS vs MARK-ERSETYPE and MARKERSELOSS). We also find that MARKERSETYPE to perform even better than MARKERSELOSS which suggests that using entity types directly as features is better than using them to provide training signals through auxiliary losses.

Modeling Entity-Relation Interactions
One main argument for joint models is that modeling the interactions between the two tasks can contribute to each other. In this section, we aim to validate if it is the case in our approach. We first study whether sharing the two representation encoders can improve performance or not. We train the entity and relation models together by jointly  optimizing L e + L r (Table 5). We find that simply sharing the encoders hurts both the entity and relation F1. We think this is because the two tasks have different input formats and require different features for predicting entity types and relations, thus using separate encoders indeed learns better task-specific features. We also explore whether the relation information can improve the entity performance. To do so, we add an auxiliary loss to our entity model, which concatenates the two span representations as well as their element-wise multiplication (see the TEXT variant in Section 5.1) and predicts the relation type between the two spans (r ∈ R or ). Through joint training with this auxiliary relation loss, we observe a negligible improvement (< 0.1%) on averaged entity F1 over 5 runs on the ACE05 development set. To summarize, (1) entity information is clearly important in predicting relations (Section 5.1). However, we don't find that relation information to improve our entity model substantially 10 ; (2) simply sharing the encoders does not provide benefits to our approach.

Mitigating Error Propagation
A well-known drawback of pipeline training is the error propagation issue. In our final model, we use gold entities (and their types) to train the relation model and the predicted entities during inference and this may lead to a discrepancy between training and testing. In the following, we describe several attempts we made to address this issue.
We first study whether using predicted entities -instead of gold entities -during training can mitigate this issue. We adopt a 10-way jackknifing method, which is a standard technique in many NLP tasks such as dependency parsing (Agić and Schluter, 2017). Specifically, we divide the data into 10 folds and predict the entities in the k-th fold using an entity model trained on the remainder. As shown in Table 6, we find that jackknifing strategy hurts the final relation performance surprisingly. We hypothesize that it is because it introduced additional noise during training. Second, we consider using more pairs of spans for the relation model at both training and testing time. The main reason is that in the current pipeline approach, if a gold entity is missed out by the entity model during inference, the relation model will not be able to predict any relations associated with that entity. Following the beam search strategy used in the previous work , we consider using λn (λ = 0.4 and n is the sentence length) 11 top spans scored by the entity model. We explored several different strategies for encoding the top-scoring spans for the relation model: (1) typed markers: the same as our main model except that we now have markers e.g., S: , /S: as input tokens; (2) untyped markers: in this case, the relation model is unaware of a span is an entity or not; (3) untyped markers trained with an auxiliary entity loss (e ∈ E or ). As Table 6 shows, none of these changes led to significant improvements and using untyped markers is espe-cially worse because the relation model struggles to identify whether a span is an entity or not.
In sum, we do not find any of these attempts improved performance significantly and our simple pipelined training turns out to be a surprisingly effective strategy. We do not argue that this error propagation issue does not exist or cannot be solved, while we will need to explore better solutions to address this issue.

Effect of Cross-sentence Context
In Table 1, we demonstrated the improvements from using cross-sentence context on both the entity and relation performance. We explore the effect of different context sizes W in Figure 2. We find that using cross-sentence context clearly improves both entity and relation F1. However, we find the relation performance doesn not further increase from W = 100 to W = 300. In our final models, we use W = 300 for the entity model and W = 100 for the relation model.

Conclusion
In this paper, we present a simple and effective approach for end-to-end relation extraction. Our model learns two encoders for entity recognition and relation extraction independently and our experiments show that it outperforms previous stateof-the-art on three standard benchmarks considerably. We conduct extensive analyses to undertand the superior performance of our approach and validate the importance of learning distinct contextual representations for entities and relations and using entity information as input features for the relation model. We also propose an efficient approximation, obtaining a large speedup at inference time with a small reduction in accuracy. We hope that this simple model will serve as a very strong baseline and make us rethink the value of joint training in end-to-end relation extraction.

A Datasets
We use ACE04, ACE05, and SciERC datasets in our experiments. Table 2 shows the data statistics of each dataset.
The ACE04 and ACE05 datasets are collected from a variety of domains, such as newswire and online forums. We follow 's preprocessing steps 12 and split ACE04 into 5 folds and ACE05 into train, development, and test sets.
The SciERC dataset is collected from 12 AI conference/workshop proceedings in four AI communities (Luan et al., 2018). SciERC includes annotations for scientific entities, their relations, and coreference clusters. We ignore the coreference annotations in our experiments. We use the processed dataset which is downloaded from the project website 13 of Luan et al. (2018).

B Implementation Details
We implement our models based on Hugging-Face's Transformers library (Wolf et al., 2019). For the entity model, we follow  and set the width embedding size as d F = 150 and use a 2-layer FFNN with 150 hidden units and ReLU activations to predict the probability distribution of entity types: P e (e | s i ) = softmax(W e FFNN(h e (s i )).
For the relation model, we use a linear classifier on top of the span pair representation to predict the probability distribution of relation types: P r (r|s i , s j ) = softmax(W r h r (s i , s j )). For our approximation model (Section 4.3), we batch candidate pairs by adding 4 markers for each pair to the end of the sentence, until the total number of tokens exceeds 250. We train our models with Adam optimizer of a linear scheduler with a warmup ratio of 0.1. For all the experiments, we train the entity model for 100 epochs, and a learning rate of 1e-5 for weights in pre-trained LMs, 5e-4 for others and a batch size of 16. We train the relation model for 10 epochs with a learning rate of 2e-5 and a batch size of 32.

C Performance with Varying Data Sizes
We compare our pipeline model to a joint model with 10%, 25%, 50%, 100% of training data on 12 We use the script provided by Luan et al. (  the ACE05 dataset. Here, our goal is to understand whether our finding still holds when the training data is smaller (and hence it is expected to have more errors in entity predictions).
Our baseline of joint model is our reimplementation of DYGIE++ , without using propagation layers (the encoders are shared for the entity and relation model and no input marker is used; the top scoring 0.4n entities are considered in beam pruning). As shown in Table 7, we find that our model achieves even larger gains in relation F1 over the joint model, when the number of training examples is reduced. This further highlights the importance of explicitly encoding entity boundaries and type features in data-scarce scenarios.