Simultaneously Self-Attending to Text and Entities for Knowledge-Informed Text Representations

Pre-trained language models have emerged as highly successful methods for learning good text representations. However, the amount of structured knowledge retained in such models, and how (if at all) it can be extracted, remains an open question. In this work, we aim at directly learning text representations which leverage structured knowledge about entities mentioned in the text. This can be particularly beneficial for downstream tasks which are knowledge-intensive. Our approach utilizes self-attention between words in the text and knowledge graph (KG) entities mentioned in the text. While existing methods require entity-linked data for pre-training, we train using a mention-span masking objective and a candidate ranking objective – which doesn’t require any entity-links and only assumes access to an alias table for retrieving candidates, enabling large-scale pre-training. We show that the proposed model learns knowledge-informed text representations that yield improvements on the downstream tasks over existing methods.


Introduction
Self-supervised representation learning on large text corpora using language modeling objectives has been shown to yield generalizable representations that improve performance for many downstream tasks. Examples of such approaches include BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b), XLNET (Yang et al., 2019), GPT-2 (Radford et al., 2019), T5 (Raffel et al., 2019) etc. However, whether such models retain structured knowledge in their representation is still an open question (Petroni et al., 2019;Poerner et al., 2019;Roberts et al., 2020) which has led to active research on knowledge-informed rep- * Equal Contribution resentations Soares et al., 2019).
Models that learn knowledge-informed representations can be broadly classified into two categories. The first approach augments language model pretraining with the aim of storing structured knowledge in the model parameters. This is typically done by augmenting the pre-training task, for example by masking entity mentions  or enforcing representational similarity in sentences containing the same entities (Soares et al., 2019). While this makes minimal assumptions, it requires memorizing all facts encountered during training in the model parameters, necessitating larger models. The second approach directly conditions the representation on structured knowledge, for example fusing mention token representations with the mentioned entity's representation .
In this paper we consider the latter approach to learning knowledge-informed representations. Conditioning on relevant knowledge removes the burden on the model parameters to memorize all facts, and allows the model to encode novel facts not seen during training. However, existing methods typically assume access to entity-linked data for training , which is scarce and expensive to annotate, preventing large scale pre-training. Moreover, these methods don't allow for bi-directional attention between both the text and the KG when representing text.
We propose a simple approach to incorporate structured knowledge into text representations. This is done using self-attention (Vaswani et al., 2017) to simultaneously attend to tokens in text and candidate KG entities mentioned in the text, in order to learn knowledge-informed representations after multiple layers of self-attention. The model is trained using a combination of a mention-masking objective and a weakly-supervised entity selection objective, which only requires access to an alias table to generate candidate entities and doesn't assume any entity-linked data for training. We show that this objective allows the model to appropriately attend to relevant entities without explicit supervision for the linked entity and learn representations that perform competitively to models trained with entity-linked data.
We make the following contributions: (1) we propose KNowledge-Informed Transformers (KNIT), an approach to learn knowledge-informed text representations which does not require entity-linked data for training, (2) we train KNIT on a large corpora curated from the web with Wikidata as the knowledge graph, (3) we evaluate the approach on multiple tasks of entity typing and entity linking and show that it performs competitively or better than existing methods, yielding large improvements even while using < 1% of task-specific data for fine-tuning.

Related Works
BERT (Devlin et al., 2019) proposed a pretraining approach, called masked language modeling (MLM), which requires randomly replacing words in a sentence with a special [MASK] token and predicting the original masked tokens. RoBERTa (Liu et al., 2019b) trained a more robust BERT model on larger data. While MLM has been shown to learn general purpose representations, the amount of factual knowledge stored in such models is limited (Petroni et al., 2019;Poerner et al., 2019).  propose a mention-masking objective which masks mentions of entities in a sentence, as opposed to random words, as a way of incorporating entity information into such models.  use entity-linked data and infuse representations of the linked entity in the final layer of the model to the representations of the corresponding entity mention. KnowBERT (Peters et al., 2019) learn an integrated entity linker that infuses entity representations into the word embedding input for the model and also relies on entity-linked data for training. K-Bert (Liu et al., 2019a) uses linked triples about entities in a sentence to inject knowledge. KGLM  proposed a fact-aware language model that selects and copies facts from KG for generation. Recently, Févry et al. (2020) introduced Entity-as-Experts (EAE), which is a masked language model coupled with an entity memory network. EAE learns to predict the entity spans, retrieves relevant entity memories and inte-grate them back to the Transformer layers. They also assume entity-linked data for training.

Knowledge-Informed Transformers (KNIT)
In this section, we describe the KNIT model as well as its training procedure. KNIT makes use of the mention-masking objective for training and conditions the encoder on both text as well as mentioned entities but does not assume any entity-linked data for training. Figure 1 shows the overall model.

Text and Entity Encoder
The input consists of a sentence along with candidate entities for the sentence. We first run a named entity extraction model on the sentence to extract mentions and then generate candidate entities based on cross-wikis (Ganea and Hofmann, 2017). We use a Wikipedia alias table for generating candidates, taken from Raiman and Raiman (2018). The start and end of mentions are demarcated using special tokens m and /m . Given the text sequence {x 1 , . . . , x n } and the set of associated candidate entities for the sequence {e 1 , . . . , e m }, we first embed the words and entities as vector embeddings. For entities, we use KG pre-trained embeddings (Lerer et al., 2019) and add a projection layer to upscale the entity embedding to the word embedding size. We will use Transformer self-attention (Vaswani et al., 2017) to encode both the text and the entities. Since self-attention has no notion of position in the sequence, it is common to concatenate a position embedding (Devlin et al., 2019) to the word embeddings. We follow this approach for the word embeddings. However, since the entities in the candidate set need to be encoded in a position-independent manner, we don't add any position embeddings to them. This entire sequence, position-dependent word embeddings and positionindependent candidates, is passed through multiple layers of self-attention. The end result is contextualized token embeddings conditioned on the entities, {x 1 , . . . ,x n }, as well as candidate entity embeddings conditioned on the text {ẽ 1 , . . . ,ẽ m }.

Training
Mention-masking While the approach described above has the potential to learn knowledgeconditioned text representations, it needs a correct pre-training objective to learn to use the extra information from the entities. Since large Transformer models (Devlin et al., 2019) have a lot of parameters, they can be highly accurate at predicting random word tokens and thus directly using a MLM objective for training will not work as the model can ignore the entity embeddings. However, we find that, due to lack of factual knowledge, these models are not very good at predicting tokens of entity mentions. Table 1 shows this for RoBERTa (Liu et al., 2019b) model. Thus, mention-masking -predicting tokens of masked entity mentions, provides a better objective to learn to use the candidate entities and learn knowledge-informed representations. Note that in Table 1, even when RoBERTa is trained with mention-masking (+MM) it is unable to provide a high accuracy on predicting mention tokens. Thus including entity embeddings should provide enough context for the model to make correct predictions by using the entities, as reflected by the KNIT score in Table 1.

Candidate Ranking
To further enable the model to use the correct entities for a mention, we use a weak entity linking objective that forces the model to rank one of the entities, from the candidate set of a mention, higher than all other entities for the sentence. Consider the i-th mention in a sentence with (m i1 , m i2 ) as the start and end indices of the mention in the sentence, and a candidate set of entities C i for this mention. We create a mention representationm i by concatenatingx m i1 and x m i2 . Now, given the representations, we score all entities for the mention i: where W is a learnable weight matrix. To enforce the model to select one entity from the mention's candidates, we find the highest scoring entity, e i = arg max j∈C i s ij , and use that as a target in a cross-entropy loss: where the softmax is over all entities (not just for mention i) in the sentence and Iê i is a one-hot vector with 1 for the entityê i and 0 everywhere else. This objective enforces the model to rank one candidate higher than others candidates for the same mention as well as candidates of other entities. Similar objective has been explored for dealing with noise in entity typing models (Xu and Barbosa, 2018;Abhishek et al., 2017). The overall objective is a combination of bert-style MLM, mention-masking (MM) and candidate ranking:

Experiments
Implementation details are in Supplementary. Code of our models is available here 1 .   Table 3: F1 score on entity typing when using only a fraction of the task-specific training data (0.05%−4%).

Results on Entity Typing
Entity typing is the task of identifying the semantic type of a given mention. We evaluate on two Entity typing datasets -OpenEntity (Choi et al., 2018) and FIGER (Ling et al., 2015). OpenEntity is a crowdsourced dataset comprising 9 general types and 121 fine-grained types. We follow    (2018) 94.88 Radhakrishnan et al. (2018) 93.00 Le and Titov (2018) 93.07 Ganea and Hofmann (2017) 92.22 KNIT 92.87 state-of-the-art without utilizing any entity-linked data for pre-training, unlike .
To further evaluate the effectiveness of KNIT, we consider the scenario where only a fraction of the data is used for task-specific fine-tuning. For this, we sample equal number of examples per type to create the fine-tuning data. The models are finetuned using the sampled data but are evaluated on the entire test set. Table.3 shows that KNIT significantly outperforms RoBERTa(Liu et al., 2019b) and RoBERTa+MM in the data constrained cases.

Results on Entity Linking
We demonstrate that our pre-trained model can capture entity linking information. For this, we use the AIDA-CoNLL (Hoffart et al., 2011) dataset and evaluate the linking performance of the model without any dataset-specific fine-tuning. We also compare with a model that used wikipedia hyperlinks for supervision during pre-training (KNIT +Wikilinks). As shown in Table 4, KNIT improves upon the candidate ranking by 12.05% and 19.66% when partial entity linking supervision from Wiki linkedtext data is available. Even without Wiki-linked data, it outperforms the best pre-trained model that considers mention context (RELIC) by 0.81%. To further explore the entity linking capacity of our model, we fine-tune the model and show that our model has competitive performance, even when using only 10% of the training data. When trained on the entire dataset, we find RELIC performs better, potentially due to the use of entity-linked data in its pre-training.

Conclusion
We propose a simple approach to learn knowledgeinformed text representations using self-attention between text and mentioned entities. Our approach does not rely on any entity-linked data for training, enabling large-scale pre-training. We show that the method learns better representations than competing approaches and also learns entity-linking without explicit linking supervision. In the future, it will be interesting to explore how such methods can be used to condition the text encoder on structured KG facts about entities.

Acknowledgments
We thank members of UMass IESL and NLP groups for helpful discussion and feedback. We also thank DiffBot for their supports in collecting the linked-text data. This work is funded in part by the Center for Data Science and the Center for Intelligent Information Retrieval. The work reported here was performed in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
2017. Fine-grained entity type classification by jointly learning representations and label embeddings.

A.1 Implementation Details(Pretraining)
To train KNIT, we collect 16M sentences from Wikipedia. We also collect 28M sentences from news articles and tag them using the DiffBot Entity Linker 2 . We further reduce the size of the entity vocabulary to 595K and remove examples that have no entity mentions. We limit each context sentence to 512 tokens and no more than 5 mentions per sentence with at least 2 and at most 10 candidate entities per mention span. We use pre-trained entity embeddings with dimension d = 200 from (Lerer et al., 2019) and keep them fixed during the course of KNIT training. We use Adam optimizer with learning rate 1e −4 , polynomial decay scheduler with warm-up, and clip norm 10. We also tune hyper-parameters in Equation (2) and choose α = 1 and β = 10. The code will be made available on github 3 .

A.2 Implementation Details(Entity Typing)
All results in Tables 2-3 are obtained by tuning a few hyperparameters -batch size, learning rate, dropout, attention dropout. Batch size was tuned in  range (16-64). Learning rate was tuned in (0.00001-0.0005). All dropouts were tuned sparsely in the range (0.1-0.3). During finetuning, we restrict the max number of candidates per mention to 10. Unlike pretraining, the entity embeddings were also finetuned during entity typing experiments and the best performing validation set checkpoint was used to generate test set results Sample dataset creation for experiments of Table 3 were done using random seeds. Three different sample datasets were collected for each of Ope-nEnt(4%), Figer(0.5%) and Figer(0.05%). Each sample would comprise an equal number of examples per entity type but randomised across the three runs. Numbers reported in Table 3 correspond to mean and standard deviation values of the performance of the three sample dataset trained models on the test set.

A.2.1 Datasets
The sizes of sample and original datasets are shown in Table 5.