ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning

Pre-trained Language Models (PLMs) have shown superior performance on various downstream Natural Language Processing (NLP) tasks. However, conventional pre-training objectives do not explicitly model relational facts in text, which are crucial for textual understanding. To address this issue, we propose a novel contrastive learning framework ERICA to obtain a deep understanding of the entities and their relations in text. Specifically, we define two novel pre-training tasks to better understand entities and relations: (1) the entity discrimination task to distinguish which tail entity can be inferred by the given head entity and relation; (2) the relation discrimination task to distinguish whether two relations are close or not semantically, which involves complex relational reasoning. Experimental results demonstrate that ERICA can improve typical PLMs (BERT and RoBERTa) on several language understanding tasks, including relation extraction, entity typing and question answering, especially under low-resource settings.


Introduction
Pre-trained Language Models (PLMs) (Devlin et al., 2018; have shown superior performance on various Natural Language Processing (NLP) tasks such as text classification (Wang et al., 2018), named entity recognition (Sang and De Meulder, 2003), and question answering (Talmor and Berant, 2019). Benefiting from designing various effective self-supervised learning objectives, such as masked language modeling (Devlin et al., 2018), PLMs can effectively capture the syntax and semantics in text to generate informative language representations for downstream NLP tasks. However, conventional pre-training objectives do not explicitly model relational facts, which frequently distribute in text and are crucial for understanding the whole text. To address this issue, some recent studies attempt to improve PLMs to better understand relations between entities (Soares et al., Peng et al., 2020). However, they mainly focus on within-sentence relations in isolation, ignoring the understanding of entities, and the interactions among multiple entities at document level, whose relation understanding involves complex reasoning patterns. According to the statistics on a human-annotated corpus sampled from Wikipedia documents by Yao et al. (2019), at least 40.7% relational facts require to be extracted from multiple sentences. Specifically, we show an example in Figure 1, to understand that "Guadalajara is located in Mexico", we need to consider the following clues jointly: (i) "Mexico" is the country of "Culiacán" from sentence 1; (ii) "Culiacán" is a rail junction lo-cated on "Panamerican Highway" from sentence 6; (iii) "Panamerican Highway" connects to "Guadalajara" from sentence 6. From the example, we can see that there are two main challenges to capture the in-text relational facts: 1. To understand an entity, we should consider its relations to other entities comprehensively. In the example, the entity "Culiacán", occurring in sentence 1, 2, 3, 5, 6 and 7, plays an important role in finding out the answer. To understand "Culiacán", we should consider all its connected entities and diverse relations among them.
2. To understand a relation, we should consider the complex reasoning patterns in text. For example, to understand the complex inference chain in the example, we need to perform multi-hop reasoning, i.e., inferring that "Panamerican Highway" is located in "Mexico" through the first two clues.
In this paper, we propose ERICA, a novel framework to improve PLMs' capability of Entity and RelatIon understanding via ContrAstive learning, aiming to better capture in-text relational facts by considering the interactions among entities and relations comprehensively. Specifically, we define two novel pre-training tasks: (1) the entity discrimination task to distinguish which tail entity can be inferred by the given head entity and relation. It improves the understanding of each entity via considering its relations to other entities in text; (2) the relation discrimination task to distinguish whether two relations are close or not semantically. Through constructing entity pairs with documentlevel distant supervision, it takes complex relational reasoning chains into consideration in an implicit way and thus improves relation understanding.
We conduct experiments on a suite of language understanding tasks, including relation extraction, entity typing and question answering. The experimental results show that ERICA improves the performance of typical PLMs (BERT and RoBERTa) and outperforms baselines, especially under lowresource settings, which demonstrates that ERICA effectively improves PLMs' entity and relation understanding and captures the in-text relational facts.

Related Work
Dai and Le (2015) and Howard and Ruder (2018) propose to pre-train universal language representations on unlabeled text, and perform task-specific fine-tuning. With the advance of computing power, PLMs such as OpenAI GPT (Radford et al., 2018), BERT (Devlin et al., 2018) and XLNet  based on deep Transformer (Vaswani et al., 2017) architecture demonstrate their superiority in various downstream NLP tasks. Since then, numerous PLM extensions have been proposed to further explore the impacts of various model architectures (Song et al., 2019;Raffel et al., 2020), larger model size (Raffel et al., 2020;Lan et al., 2020;Fedus et al., 2021), more pre-training corpora , etc., to obtain better general language understanding ability. Although achieving great success, these PLMs usually regard words as basic units in textual understanding, ignoring the informative entities and their relations, which are crucial for understanding the whole text.
To improve the entity and relation understanding of PLMs, a typical line of work is knowledgeguided PLM, which incorporates external knowledge such as Knowledge Graphs (KGs) into PLMs to enhance the entity and relation understanding. Some enforce PLMs to memorize information about real-world entities and propose novel pretraining objectives (Xiong et al., 2019;Yamada et al., 2020). Others modify the internal structures of PLMs to fuse both textual and KG's information Peters et al., 2019;He et al., 2020). Although knowledge-guided PLMs introduce extra factual knowledge in KGs, these methods ignore the intrinsic relational facts in text, making it hard to understand out-of-KG entities or knowledge in downstream tasks, let alone the errors and incompleteness of KGs. This verifies the necessity of teaching PLMs to understand relational facts from contexts.
Another line of work is to directly model entities or relations in text in pre-training stage to break the limitations of individual token representations. Some focus on obtaining better span representations, including entity mentions, via span-based pre-training Joshi et al., 2020;Kong et al., 2020;Ye et al., 2020). Others learn to extract relation-aware semantics from text by comparing the sentences that share the same entity pair or distantly supervised relation in KGs (Soares et al., 2019;Peng et al., 2020). However, these methods only consider either individual entities or within-sentence relations, which limits the performance in dealing with multiple entities and relations at document level. In contrast, our ERICA considers the interactions among multiple entities For an entity pair with its distantly supervised relation in text, the ED task requires the ground-truth tail entity to be closer to the head entity than other entities. and relations comprehensively, achieving a better understanding of in-text relational facts.

Methodology
In this section, we introduce the details of ERICA. We first describe the notations and how to represent entities and relations in documents. Then we detail the two novel pre-training tasks: Entity Discrimination (ED) task and Relation Discrimination (RD) task, followed by the overall training objective.

Notations
ERICA is trained on a large-scale unlabeled corpus leveraging the distant supervision from an external KG K. Formally, let D = {d i } |D| i=1 be a batch of documents and E i = {e ij } |E i | j=1 be all named entities in d i , where e ij is the j-th entity in d i . For each document d i , we enumerate all entity pairs (e ij , e ik ) and link them to their corresponding relation r i jk in K (if possible) and obtain a tuple set T i = {t i jk = (d i , e ij , r i jk , e ik )|j = k}. We assign no_relation to those entity pairs without relation annotation in K. Then we obtain the overall tuple set T = T 1 T 2 ... T |D| for this batch. The positive tuple set T + is constructed by removing all tuples with no_relation from T . Benefiting from document-level distant supervision, T + includes both intra-sentence (relatively simple cases) and inter-sentence entity pairs (hard cases), whose relation understanding involves cross-sentence, multi-hop, or coreferential reasoning, i.e., T + = T + single T + cross .

Entity & Relation Representation
For each document d i , we first use a PLM to encode it and obtain a series of hidden states {h 1 , h 2 , ..., h |d i | }, then we apply mean pooling operation over the consecutive tokens that mention e ij to obtain local entity representations. Note e ij may appear multiple times in d i , the k-th occurrence of e ij , which contains the tokens from index n k start to n k end , is represented as: To aggregate all information about e ij , we average 2 all representations of each occurrence m k e ij as the global entity representation e ij . Following Soares et al. (2019), we concatenate the final representations of two entities e ij 1 and e ij 2 as their relation representation, i.e., r i j 1 j 2 = [e ij 1 ; e ij 2 ].

Entity Discrimination Task
Entity Discrimination (ED) task aims at inferring the tail entity in a document given a head entity and a relation. By distinguishing the ground-truth tail entity from other entities in the text, it teaches PLMs to understand an entity via considering its relations with other entities. As shown in Figure 2, in practice, we first sample a tuple t i jk = (d i , e ij , r i jk , e ik ) from T + , PLMs are then asked to distinguish the groundtruth tail entity e ik from other entities in the document d i . To inform PLMs of which head entity and relation to be conditioned on, we concatenate the relation name of r i jk , the mention of head entity e ij and a separation token [SEP] in front of d i , i.e., d * i ="relation_name entity_mention[SEP] d i " 3 . The goal of entity discrimination task is equivalent to maximizing the posterior P(e ik |e ij , r i jk ) = softmax(f (e ik )) (f (·) indicates an entity classifier). However, we empirically find directly optimizing the posterior cannot well consider the relations among entities. Hence, we borrow the idea of contrastive learning (Hadsell et al., 2006) and push the representations of positive pair (e ij , e ik ) closer than negative pairs, the loss function of ED task can be formulated as:  where cos(·, ·) denotes the cosine similarity between two entity representations and τ (temperature) is a hyper-parameter.

Relation Discrimination Task
Relation Discrimination (RD) task aims at distinguishing whether two relations are close or not semantically. Compared with existing relationenhanced PLMs, we employ document-level rather than sentence-level distant supervision to further make PLMs comprehend the complex reasoning chains in real-world scenarios and thus improve PLMs' relation understanding. As depicted in Figure 3, we train the text-based relation representations of the entity pairs that share the same relations to be closer in the semantic space. In practice, we linearly 4 sample a tuple pair Using the method mentioned in Sec. 3.2, we obtain the positive relation representations r t A and r t B for t A and t B . To discriminate positive examples from negative ones, similarly, we adopt contrastive learning and define the loss function of RD task as follows: 4 The sampling rate of each relation is proportional to its total number in the current batch.
where N is a hyper-parameter. We ensure t B is sampled in Z and construct N − 1 negative examples by sampling t C (r A = r C ) from T , instead of T +5 . By additionally considering the last three terms of L RD in Eq.3, which require the model to distinguish complex inter-sentence relations with other relations in the text, our model could have better coverage and generality of the reasoning chains. PLMs are trained to perform reasoning in an implicit way to understand those "hard" inter-sentence cases.

Overall Objective
Now we present the overall training objective of ERICA. To avoid catastrophic forgetting (Mc-Closkey and Cohen, 1989) of general language understanding ability, we train masked language modeling task (L MLM ) together with ED and RD tasks. Hence, the overall learning objective is formulated as follows: It is worth mentioning that we also try to mask entities as suggested by Soares et al. (2019) and Peng et al. (2020), aiming to avoid simply relearning an entity linking system. However, we do not observe performance gain by such a masking strategy. We conjecture that in our document-level setting, it is hard for PLMs to overfit on memorizing entity mentions due to the better coverage and generality of document-level distant supervision. Besides, masking entities creates a gap between pre-training and fine-tuning, which may be a shortcoming of previous relation-enhanced PLMs.

Experiments
In this section, we first describe how we construct the distantly supervised dataset and pre-training details for ERICA. Then we introduce the experiments we conduct on several language understanding tasks, including relation extraction (RE), entity typing (ET) and question answering (QA). We test ERICA on two typical PLMs, including BERT and RoBERTa (denoted as ERICA BERT and ERICA RoBERTa ) 6 . We leave the training details for downstream tasks and experiments on GLUE benchmark (Wang et al., 2018) in the appendix.

Distantly Supervised Dataset Construction
Following Yao et al. (2019), we construct our pretraining dataset leveraging distant supervision from the English Wikipedia and Wikidata. First, we use spaCy 7 to perform Named Entity Recognition, and then link these entity mentions as well as Wikipedia's mentions with hyper-links to Wikidata items, thus we obtain the Wikidata ID for each entity. The relations between different entities are annotated distantly by querying Wikidata. We keep the documents containing at least 128 words, 4 entities and 4 relational triples. In addition, we ignore those entity pairs appearing in the test sets of RE and QA tasks to avoid test set leakage. In the end, we collect 1, 000, 000 documents (about 1G storage) in total with more than 4, 000 relations annotated distantly. On average, each document contains 186.9 tokens, 12.9 entities and 7.2 relational triples, an entity appears 1.3 times per document. Based on the human evaluation on a random sample of the dataset, we find that it achieves an F1 score of 84.7% for named entity recognition, and an F1 score of 25.4% for relation extraction.

Pre-training Details
We initialize ERICA BERT and ERICA RoBERTa with bert-base-uncased and roberta-base checkpoints released by Google 8 and Huggingface 9 . We adopt AdamW (Loshchilov and Hutter, 2017) as the optimizer, warm up the learning rate for the first 20% steps and then linearly decay it. We set the learning rate to 3 × 10 −5 , weight decay to 1 × 10 −5 , batch size to 2, 048 and temperature τ to 5 × 10 −2 . For L RD , we randomly select up to 64 negative samples per document. We train both models with 8 NVIDIA Tesla P40 GPUs for 2, 500 steps.

Relation Extraction
Relation extraction aims to extract the relation between two recognized entities from a pre-defined relation set.  three partitions of the training set (1%, 10% and 100%) and report results on test sets. Document-level RE For document-level RE, we choose DocRED (Yao et al., 2019), which requires reading multiple sentences in a document and synthesizing all the information to identify the relation between two entities. We encode all entities in the same way as in pre-training phase. The relation representations are obtained by adding a bilinear layer on top of two entity representations. We choose the following baselines: (1) CNN (Zeng et al., 2014), BILSTM (Hochreiter and Schmidhuber, 1997), BERT (Devlin et al., 2018) and RoBERTa , which are widely used as text encoders for relation extraction tasks; (2) HINBERT (Tang et al., 2020) which employs a hierarchical inference network to leverage the abundant information from different sources; (3) CorefBERT (Ye et al., 2020) which proposes a pre-training method to help BERT capture the coreferential relations in context; (4) SpanBERT (Joshi et al., 2020) which masks   (Peng et al., 2020) which introduce sentence-level relation contrastive learning for BERT via distant supervision. For fair comparison, we pre-train these baselines on our constructed pre-training data 10 based on the implementation released by Peng et al. (2020) 11 . From the results shown in Table 1, we can see that: (1) ERICA outperforms all baselines significantly on each supervised data size, which demonstrates that ER-ICA could better understand the relations among entities in the document via implicitly considering their complex reasoning patterns in the pre-training; (2) both MTB and CP achieve worse results than BERT, which means sentence-level pre-training, lacking consideration for complex reasoning patterns, hurts PLM's performance on document-level RE tasks to some extent; (3) ERICA outperforms baselines by a larger margin on smaller training sets, which means ERICA has gained pretty good document-level relation reasoning ability in contrastive learning, and thus obtains improvements more extensively under low-resource settings.   ERICA does not impair PLMs' performance on sentence-level relation understanding.

Entity Typing
Entity typing aims at classifying entity mentions into pre-defined entity types. We choose FIGER , which is a sentencelevel entity typing dataset labeled with distant supervision. BERT, RoBERTa, MTB, CP and ERNIE are chosen as baselines. From the results listed in Table 3, we observe that, ERICA outperforms all baselines, which demonstrates that ER-ICA could better represent entities and distinguish them in text via both entity-level and relation-level contrastive learning.

Question Answering
Question answering aims to extract a specific answer span in text given a question. We conduct experiments on both multi-choice and extractive QA. We test multiple partitions of the training set.
Multi-choice QA For Multi-choice QA, we choose WikiHop (Welbl et al., 2018), which requires models to answer specific properties of an entity after reading multiple documents and conducting multi-hop reasoning. It has both standard and masked settings, where the latter setting masks all entities with random IDs to avoid information leakage. We first concatenate the question and documents into a long sequence, then we find all the occurrences of an entity in the documents, encode them into hidden representations and obtain the global entity representation by applying mean pooling on these hidden representations. Finally, we use a classifier on top of the entity representation for prediction. We choose the following baselines: (1) FastQA (Weissenborn et al., 2017) and BiDAF (Seo et al., 2016), which are widely used question answering systems; (2) BERT, RoBERTa, CorefBERT, SpanBERT, MTB and CP, which are introduced in previous sections. From the results listed in Table 4, we observe that ERICA outperforms baselines in both settings, indicating that ERICA can better understand entities and their relations in the documents and extract the true answer according to queries. The significant improvements in the masked setting also indicate that ERICA can better perform multi-hop reasoning to synthesize and analyze information from contexts, instead of relying on entity mention "shortcuts" (Jiang and Bansal, 2019).
Extractive QA For extractive QA, we adopt three widely-used datasets: SQuAD (Rajpurkar et al., 2016), TriviaQA (Joshi et al., 2017) and Natu-ralQA  in MRQA (Fisch et al., 2019) to evaluate ERICA in various domains. Since MRQA does not provide the test set for each dataset, we randomly split the original dev set into two halves and obtain the new dev/test set. We follow the QA setting of BERT (Devlin et al., 2018): we concatenate the given question and passage into one long sequence, encode the sequence by PLMs and adopt two classifiers to predict the start and end index of the answer. We choose BERT, RoBERTa, MTB and CP as baselines. From the results listed in Table 5, we observe that ERICA outperforms all baselines, indicating that through the enhancement of entity and relation understanding, ERICA is more capable of capturing in-text relational facts and synthesizing information of entities. This ability further improves PLMs for question answering.

Analysis
In this section, we first conduct a suite of ablation studies to explore how L ED and L RD contribute to  ERICA. Then we give a thorough analysis on how pre-training data's domain / size and methods for entity encoding impact the performance. Lastly, we visualize the entity and relation embeddings learned by ERICA.

Ablation Study
To demonstrate that the superior performance of ERICA is not owing to its longer pretraining (2500 steps) on masked language modeling, we include a baseline by optimizing L MLM only (removing the Next Sentence Prediction (-NSP) loss (Devlin et al., 2018)). In addition, to explore how L ED and L RD impact the performance, we keep only one of these two losses and compare the results. Lastly, to evaluate how intra-sentence and inter-sentence entity pairs contribute to RD task, we compare the performances of only sampling intra-sentence entity pairs (L T + From the results shown in Table 6, we can see that: (1) extra pretraining (-NSP) only contributes a little to the overall improvement. (2) For DocRED and FIGER, either L ED or L RD is beneficial, and combining them further improves the performance; For WikiHop, L ED dominates the improvement while L RD hurts the performance slightly, this is possibly because question answering more resembles the tail entity discrimination process, while the relation discrimination process may have conflicts with it. (3) For L RD , both intra-sentence and inter-sentence entity pairs contribute, which demonstrates that incorporating both of them is necessary for PLMs to understand relations between entities in text comprehensively. We also found empiri-   cally that when these two auxiliary objectives are only added into the fine-tuning stage, the model does not have performance gain. The reason is that the size and diversity of entities and relations in downstream training data are limited. Instead, pretraining with distant supervision on a large corpus provides a solution for increasing the diversity and quantity of training examples.

Effects of Domain Shifting
We investigate two domain shifting factors: entity distribution and relation distribution, to explore how they impact ERICA's performance.

Entity Distribution Shifting
The entities in supervised datasets of DocRED are recognized by human annotators while our pre-training data is processed by spaCy. Hence there may exist an entity distribution gap between pre-training and finetuning. To study the impacts of entity distribution shifting, we fine-tune a BERT model on training set of DocRED for NER tagging and re-tag entities in our pre-training dataset. Then we pre-train ERICA on the newly-labeled training corpus (denoted as ERICA DocRED BERT ). From the results shown in Table 7, we observe that it performs better than the original ERICA, indicating that pre-training on a dataset that shares similar entity distributions with downstream tasks is beneficial.
Relation Distribution Shifting Our pre-training data contains over 4, 000 Wikidata relations. To investigate whether training on a more diverse relation domain benefits ERICA, we train it with the pre-training corpus that randomly keeps only 30%, 50% and 70% the original relations, and compare  Table 8: Results (IgF1) on how entity encoding strategy influences ERICA's performance on DocRED. We also show the impacts of entity distribution shifting (ERICA DocRED BERT and ERICA DocRED BERT ) as is mentioned in the main paper. their performances. From the results in Figure 4, we observe that the performance of ERICA improves constantly as the diversity of relation domain increases, which reveals the importance of using diverse training data on relation-related tasks. Through detailed analysis, we further find that ER-ICA is less competent at handling unseen relations in the corpus. This may result from the construction of our pre-training dataset: all the relations are annotated distantly through an existing KG with a pre-defined relation set. It would be promising to introduce more diverse relation domains during data preparation in future.

Effects of Pre-training Data's Size
To explore the effects of pre-training data's size, we train ERICA on 10%, 30%, 50% and 70% of the original pre-training dataset, respectively. We report the results in Figure 5, from which we observe that with the scale of pre-training data becoming larger, ERICA is performing better.

Effects of Methods for Entity Encoding
For all the experiments mentioned above, we encode each occurrence of an entity by mean pooling over all its tokens in both pre-training and downstream tasks. Ideally, ERICA should have consis-tent improvements on other kinds of methods for entity encoding. To demonstrate this, we try another entity encoding method mentioned by Soares et al. (2019) on three splits of DocRED (1%, 10% and 100%). Specifically, we insert a special start token [S] in front of an entity and an end token [E] after it. The representation for this entity is calculated by averaging the representations of all its start tokens in the document. To help PLMs discriminate different entities, we randomly assign different marker pairs ([S1], [E1]; [S2], [E2], ...) for each entity in a document in both pre-training and downstream tasks 12 . All occurrences of one entity in a document share the same marker pair. We show in Table 8 that ERICA achieves consistent performance improvements for both methods (denoted as Mean Pool and Entity Marker), indicating that ERICA is applicable to different methods for entity encoding. Specifically, Entity Marker achieves better performance when the scale of training data is large while Mean Pool is more powerful under low-resource settings. We also notice that training on a dataset that shares similar entity distributions is more helpful for Mean Pool, where ERICA DocRED BERT achieves 60.8 (F1) and 58.4 (IgF1) on 100% training data.

Embedding Visualization
In Figure 6, we show the learned entity and relation embeddings of BERT and ERICA BERT on DocRED's dev set by t-distributed stochastic neighbor embedding (t-SNE) (Hinton and Roweis, 2002). We label points with different colors to represent its corresponding category of entities or relations 13 in Wikidata and only visualize the most frequent 10 relations. From the figure, we can see that jointly training L MLM with L ED and L RD leads to a more compact clustering of both entities and relations belonging to the same category. In contrast, only training L MLM exhibits random distribution. This verifies that ERICA could better understand and represent both entities and relations in the text. 12 In practice, we randomly initialize 100 entity marker pairs. 13 (Key, value) pairs for relations defined in Wikidata are: (P176, manufacturer); (P150, contains administrative territorial entity); (P17, country); (P131, located in the administrative territorial entity); (P175, performer); (P27, country of citizenship); (P569, date of birth); (P1001, applies to jurisdiction); (P57, director); (P179, part of the series).

Conclusions
In this paper, we present ERICA, a general framework for PLMs to improve entity and relation understanding via contrastive learning. We demonstrate the effectiveness of our method on several language understanding tasks, including relation extraction, entity typing and question answering. The experimental results show that ERICA outperforms all baselines, especially under low-resource settings, which means ERICA helps PLMs better capture the in-text relational facts and synthesize information about entities and their relations.

Appendices A Training Details for Downstream Tasks
In this section, we introduce the training details for downstream tasks (relation extraction, entity typing and question answering). We implement all models based on Huggingface transformers 14 .

A.1 Relation Extraction
Document-level Relation Extraction For document-level relation extraction, we did experiments on DocRED (Yao et al., 2019). We modify the official code 15 for implementation. For experiments on three partitions of the original training set (1%, 10% and 100%), we adopt batch size of 10, 32, 32 and training epochs of 400, 400, 200, respectively. We choose Adam optimizer (Kingma and Ba, 2014) as the optimizer and the learning rate is set to 4 × 10 −5 . We evaluate on dev set every 20/20/5 epochs and then test the best checkpoint on test set on the official evaluation server 16 .

Sentence-level Relation
Extraction For sentence-level relation extraction, we did experiments on TACRED (Zhang et al., 2017) and SemEval-2010 Task 8 (Hendrickx et al., 2019) based on the implementation of Peng et al. (2020) 17 . We did experiments on three partitions (1%, 10% and 100%) of the original training set. The relation representation for each entity pair is obtained in the same way as in pre-training phase. Other settings are kept the same as Peng et al. (2020) for fair comparison. models for three epochs, other hyper-parameters are kept the same as ERNIE.

A.3 Question Answering
Multi-choice QA For multi-choice question answering, we choose WikiHop (Welbl et al., 2018). Since the standard setting of WikiHop does not provide the index for each candidate, we then find them by exactly matching them in the documents. We did experiments on three partitions of the original training data (1%, 10% and 100%). We set the batch size to 8 and learning rate to 5 × 10 −5 , and train for two epochs.
Extractive QA For extractive question answering, we adopt MRQA (Fisch et al., 2019) as the testbed and choose three datasets: SQuAD (Rajpurkar et al., 2016), TriviaQA (Joshi et al., 2017) and NaturalQA . We adopt Adam as the optimizer, set the learning rate to 3 × 10 −5 and train for two epochs. In the main paper, we report results on two splits (10% and 100%) and results on 1% are listed in Table 11.

B Generalized Language Understanding (GLUE)
The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) provides several natural language understanding tasks, which is often used to evaluate PLMs. To test whether L ED and L RD impair the PLMs' performance on these tasks, we compare BERT, ERICA BERT , RoBERTa and ERICA RoBERTa . We follow the widely used setting and use the [CLS] token as representation for the whole sentence or sentence pair for classification or regression. Table 9 shows the results on dev sets of GLUE Benchmark. It can be observed that both ERICA BERT and ERICA RoBERTa achieve comparable performance than the original model, which suggests that jointly training L ED and L RD with L MLM does not hurt PLMs' general ability of language understanding.

C Full results of ablation study
Full results of ablation study (DocRED, WikiHop and FIGER) are listed in Table 10.

D Joint Named Entity Recognition and Relation Extraction
Joint Named Entity Recognition (NER) and Relation Extraction (RE) aims at identifying entities in text and the relations between them. We