A Simple but Effective Pluggable Entity Lookup Table for Pre-trained Language Models

Pre-trained language models (PLMs) cannot well recall rich factual knowledge of entities exhibited in large-scale corpora, especially those rare entities. In this paper, we propose to build a simple but effective Pluggable Entity Lookup Table (PELT) on demand by aggregating the entity’s output representations of multiple occurrences in the corpora. PELT can be compatibly plugged as inputs to infuse supplemental entity knowledge into PLMs. Compared to previous knowledge-enhanced PLMs, PELT only requires 0.2%-5% pre-computation with capability of acquiring knowledge from out-of-domain corpora for domain adaptation scenario. The experiments on knowledge-related tasks demonstrate that our method, PELT, can flexibly and effectively transfer entity knowledge from related corpora into PLMs with different architectures. Our code and models are publicly available at https://github.com/thunlp/PELT


Introduction
Recent advance in pre-trained language models (PLMs) has achieved promising improvements in various downstream tasks (Devlin et al., 2019;. Some latest works reveal that PLMs can automatically acquire knowledge from largescale corpora via self-supervised pre-training and then encode the learned knowledge into their model parameters (Tenney et al., 2019;Petroni et al., 2019;Roberts et al., 2020). However, due to the limited capacity of vocabulary, existing PLMs face the challenge of recalling the factual knowledge from their parameters, especially for those rare entities (Gao et al., 2019a;Wang et al., 2021a).
To improve PLMs' capability of entity understanding, a straightforward solution is to exploit * Corresponding author: M. Sun (sms@tsinghua.edu  an external entity embedding acquired from the knowledge graph (KG) (Zhang et al., 2019;, the entity description (Peters et al., 2019), or the corpora (Pörner et al., 2020). In order to make use of the external knowledge, these models usually learn to align the external entity embedding (Bordes et al., 2013;Yamada et al., 2016) to the their original word embedding. However, previous works ignore to explore entity embedding from the PLM itself, which makes their learned embedding mapping is not available in the domain-adaptation. Other recent works attempt to infuse knowledge into PLMs' parameters by extra pre-training, such as learning to build an additional entity vocabulary from the corpora (Yamada et al., 2020;Févry et al., 2020), or adopting entity-related pre-training tasks to intensify the entity representation (Xiong et al., 2020;Sun et al., 2020;Wang et al., 2021b). However, their huge pre-computation increases the cost of extending or updating the customized vocabulary for various downstream tasks.
In this paper, we introduce a simple but effective Pluggable Entity Lookup Table (PELT) to infuse knowledge into PLMs. To be specific, we first revisit the connection between PLMs' input features and output representations for masked language modeling. Based on this, given a new corpus, we aggregate the output representations of masked tokens from the entity's occurrences, to recover an elaborate entity embedding from a well-trained PLM. Benefiting from the compatibility and flexibility of the constructed embedding, we can directly insert them into the corresponding positions of the input sequence to provide supplemental entity knowledge. As shown in Table 1, our method merely consumes 0.2%∼5% pre-computation compared with previous works, and it also supports the vocabulary from different domains simultaneously.
We conduct experiments on two knowledgerelated tasks, including knowledge probe and relation classification, across two domains (Wikipedia and biomedical publication). Experimental results show that PLMs with PELT can consistently and significantly outperform the corresponding vanilla models. In addition, the entity embedding obtained from multiple domains are compatible with the original word embedding and can be applied and transferred swiftly.

Methodology
In this section, we first revisit the masked language modeling pre-training objective. After that, we introduce the pluggable entity lookup table and explain how to apply it to incorporate knowledge into PLMs.

Revisit Masked Language Modeling
PLMs conduct self-supervised pre-training tasks, such as masked language modeling (MLM) (Devlin et al., 2019), to learn the semantic and syntactic knowledge from the large-scale unlabeled corpora (Rogers et al., 2020). MLM can be regarded as a kind of cloze task, which requires the model to predict the missing tokens based on its contextual representation. Formally, given a sequence of tokens X = (x 1 , x 2 , . . . , x n ), with x i substituted by [MASK], PLMs, such as BERT, first take tokens' word embedding and position embedding as input and obtain the contextual representation: where Enc(·) denotes a deep bidirectional Transformer encoder, LayerNorm(·) denotes layer normalization (Ba et al., 2016), E ∈ R |V |×D is the word embedding matrix, V is the word vocabulary, P is the absolute position embedding and H = (h 1 , h 2 , . . . , h n ) is the contextual representation. After that, BERT applies a feed-forward network (FFN) and layer normalization on the con-WTO regards [MASK] has become a global epidemic.
[MASK] is the disease caused by severe acute respiratory.

PLM Encoding
Masked Token's Output Rep.

COVID-19
COVID-19 Occurring Sentence Figure 1: An illustration of the our PELT. textual representation to compute the output representation of x i : ( 2) Since the weights in the softmax layer and word embeddings are tied in BERT, the model calculate the product of r x i and the input word embedding matrix to further compute x i 's cross-entropy loss among all the words: . (3)

Construct Pluggable Entity Embedding
Due to the training efficiency, the vocabulary sizes in existing PLMs typically range from 30K to 60K subword units, and thus PLMs have to disperse the information of massive entities into their subword embeddings. Through revisiting the MLM loss in Eq. 3, we could intuitively observe that the word embedding and the output representation of BERT are located in the same vector space. Hence, we are able to recover the entity embedding from BERT's output representations to infuse their contextualized knowledge to the model. To be specific, given a general or domainspecific corpus, we design to build the lookup table for entities that occurs in the downstream tasks on demand. For an entity e, such as a Wikidata entity or a proper noun entity, we construct its embedding E(e) as follows: Direction A feasible method to add entity e to the vocabulary of PLM is to optimize its embedding E(e) for the MLM loss with other parameters frozen. We collect the sentences S e that contain entity e and substitute it with [MASK]. The total influence of E(e) to the MLM loss in S e can be formulated as: where Z x i = w j ∈V ∪{e} exp(E(w j ) T r x i ), x i is the replaced masked token for entity e and r x i is the PLM's output representation of x i .
Compared with the total impact of the entire vocabulary on Z x i , E(e) has a much smaller impact. If we ignore the minor effect of E(e) on Z x i , the optimal solution of E(e) for L(e) is proportional to x i ∈Se r x i . Hence, we set E(e) as: where C denotes the scaling factor. Practically, E(e) also serves as the negative loglikelihood of other words' MLM loss (Kong et al., 2020). However, Gao et al. (2019a) indicates that the gradient from such negative log-likelihood will push all words to a uniformly negative direction, which weakens the quality of rare words' representation. Here, we ignore this negative term and obtain the informative entity embedding from Eq. 5.
Norm We define p(e) as the position embedding for entity e. Since the layer normalization in Eq. 1 makes the norm |E(e) + p(e)| to D 1 2 , we find that the norm |E(e)| has little effect on the input feature of the encoder in use. Therefore, we set the norm of all the entity embeddings as a constant L. Then, we evaluate the model with different L on the unsupervised knowledge probe task and choose the best L for those fine-tuning tasks.

Infuse Entity Knowledge into PLMs
Since the entity embedding we obtained and the original word embedding are both obtained from the masked language modeling objective, the entity can be regarded as a special input token. To infuse entity knowledge into PLMs, we apply a pair of bracket to enclose the constructed entity embedding and then insert it after the original entity's subwords. For example, the original input, Most people with COVID-19 have a dry [MASK] they can feel in their chest. becomes Most people with COVID-19 (COVID-19) have a dry [MASK] they can feel in their chest.
Here, the entity COVID-19 adopts our constructed entity embedding and other words use their original embedding. We simply convey the modified input to the PLM for encoding without any additional structures or parameters, to help the model predict [MASK] as cough.
A note on entity links In previous section, we hypothesize that we know the entity linking annotations for the involved string name. In practice, we can obtain the gold entity links provided by some datasets like FewRel 1.0. For the datasets where the linking annotations are not available, we employ a heuristic string matching for entity linking 1 .

Implementation Details
We choose RoBERTa Base , a welloptimized PLM, as our baseline model and we equip it with our constructed entity embedding to obtain the PELT model. For the knowledge probe task, we further experiment with another encoderarchitecture model, uncased BERT Base (Devlin et al., 2019), and an encoder-decoder-architecture model, BART Base (Lewis et al., 2020).
We adopt Wikipedia and biomedical S2ORC (Lo et al., 2020) as the domain-specific corpora and split them into sentences with NLTK (Xue, 2011). For Wikipedia, we adopt a heuristic entity linking strategy with the help of hyperlink annotations. For the used FewRel 1.0 and Wiki80 datasets, we directly use the annotated linking information. For other datasets, we link the given entity name through a simple string match. For each necessary entity, we first extract up to 256 sentences containing the entity from the corpora. We adopt Wikipedia as the domain-specific corpus for FewRel 1.0, Wiki80 and LAMA, and we adopt S2ORC as the domain-specific corpus for FewRel 2.0. After that, we construct the entity embedding according to Section 2.2.
We search the norm of entity embedding L among 1-10 on the knowledge probe task. We find L = 7, 10, 3 performs a bit better for RoBERTa, BERT and BART respectively. In the fine-tuning process, we freeze the constructed embeddings as an lookup table with the corresponding norm. After that, we run all the fine-tuning experiments with 5 different seeds and report the average score.

Relation Classification
Relation Classification (RC) aims to predict the relationship between two entities in a given text. We evaluate the models on two scenarios, the fewshot setting and the full-data setting. The few-shot setting focuses on long-tail relations without sufficient training instances. We evaluate models on FewRel 1.0 (Han et al., 2018) and FewRel 2.0 (Gao et al., 2019b). FewRel 1.0 contains instances with Wikidata facts and FewRel 2.0 involves a biomedical-domain test set to examine the ability of domain adaptation. In the N -way K-shot setting, models are required to categorize the query as one of the existing N relations, each of which contains K supporting samples. We choose the state-of-the-art few-shot framework Proto (Snell et al., 2017) with different PLM encoders for evaluation. For the full-data setting, we evaluate models on the Wiki80, which contains 80 relation types from Wikidata. We also add 1% and 10% settings, meaning using only 1% / 10%  data of the training sets. As shown in Table 2 and Table 3, on FewRel 1.0 and Wiki80 in Wikipedia domain, RoBERTa with PELT beats the RoBERTa model by a large margin (e.g. +3.3% on 10way-1shot), and it even achieves comparable performance with ERNIE, which has access to the knowledge graph. Our model also gains huge improvements on FewRel 2.0 in the biomedical domain (e.g. +7.1% on 10way-1shot), while the entity-aware baselines have little advance in most settings. Compared with most existing entity-aware PLMs which merely obtain domain-specific knowledge in the pre-training phase, our proposed pluggable entity lookup table can dynamically update the models' knowledge from the out-of-domain corpus on demand.

Knowledge Probe
We conduct experiments on a widely-used knowledge probe dataset, LAMA (Petroni et al., 2019). It applies cloze-style questions to examine PLMs' ability on recalling facts from their parameters. For example, given a question template Paris is the capital of [MASK], PLMs are required to predict the masked token properly. In this paper, we not only  use Gooogle-RE and T-REx (ElSahar et al., 2018) which focus on factual knowledge, but also evaluate models on LAMA-UHN (Pörner et al., 2020) which filters out the easy questionable templates. As shown in Table 4, without any pre-training, the PELT model can directly absorb the entity knowledge from the extended input sequence to recall more factual knowledge, which demonstrates that the entity embeddings we constructed are compatible with original word embeddings. We also find that our method can also bring huge improvements to both BERT and BART in the knowledge probe task, which proves our method's generalization on different-architecture PLMs. Table 5 shows the P@1 results with respect to the entity frequency. While RoBERTa performs worse on rare entities than frequent entities, PELT brings a substantial improvement on rare entities, i.e., near 3.8 mean P@1 gains on entities that occur less than 50 times.

Conclusion
In this paper, we propose PELT, a flexible entity lookup table, to incorporate up-to-date knowledge into PLMs. By constructing entity embeddings on demand, PLMs with PELT can recall rich factual knowledge to help downstream tasks.

A Heuristic String Matching for Entity Linking
For the Wikipedia, we first create a mapping from the anchor texts with hyperlinks to their referent Wikipedia pages. After that, We employ a heuristic string matching to link other potential entities to their pages. For preparation, we collect the aliases of the entity from the redirect page of Wikipedia and the relation between entities from the hyperlink. Then, we apply spaCy 2 to recognize the entity name in the text. An entity name in the text may refer to 2 https://spacy.io/ multiple entities of the same alias. We utilize the relation of the linked entity page to maintain an available entity page set for entity disambiguation .
Algorithm 1 Heuristic string matching for entity disambiguation S ⇐ { the linked entity page in anchor text} E ⇐ { potential entity name in text} repeat S ⇐ { the neighbor entity pages that have hyperlink or Wikidata relation with pages in S} E ⇐ {e|e ∈ E and e can be uniquely linked to entity page in S by string matching } E ⇐ E − E S ⇐ E until S = φ Details of the heuristic string matching are shown in Algorithm 1, we match the entity name to surrounding entity page of the current page as close as possible. e will release all the source code and models with the pre-processed Wikipedia dataset.
For other datases, we adopt a simple string matching for entity linking.

B Training Configuration
We train all the models with Adam optimizer (Kingma and Ba, 2015), 10% warming up steps and maximum 128 input tokens. Detailed training hyper-parameters are shown in Table 6.
We run all the experiments with 5 different seeds (42,43,44,45,46) and report the average score with the standard deviation. In the 1% and 10% settings' experiments for Wiki80, we train the model with 10-25 times epochs as that of the 100% setting's experiment.
For FewRel, we search the batch size among [4,8,32] and search the training step in [1500,2000,2500]. We evaluate models every 250 on validation and save the model with best performance for testing. With our hyper-parameter tuning, the results of baselines in FewRel significantly outperforms that reported by KEPLER (Wang et al., 2021b