Robust Retrieval Augmented Generation for Zero-shot Slot Filling

Automatically inducing high quality knowledge graphs from a given collection of documents still remains a challenging problem in AI. One way to make headway for this problem is through advancements in a related task known as slot filling. In this task, given an entity query in form of [Entity, Slot, ?], a system is asked to ‘fill’ the slot by generating or extracting the missing value exploiting evidence extracted from relevant passage(s) in the given document collection. The recent works in the field try to solve this task in an end-to-end fashion using retrieval-based language models. In this paper, we present a novel approach to zero-shot slot filling that extends dense passage retrieval with hard negatives and robust training procedures for retrieval augmented generation models. Our model reports large improvements on both T-REx and zsRE slot filling datasets, improving both passage retrieval and slot value generation, and ranking at the top-1 position in the KILT leaderboard. Moreover, we demonstrate the robustness of our system showing its domain adaptation capability on a new variant of the TACRED dataset for slot filling, through a combination of zero/few-shot learning. We release the source code and pre-trained models.


Introduction
Slot filling is a sub-task of Knowledge Base Population (KBP), where the goal is to recognize a pre-determined set of relations for a given entity and use them to populate infobox like structures. This can be done by exploring the occurrences of the input entity in the corpus and gathering information about its slot fillers from the context in which it is located. A slot filling system processes and indexes a corpus of documents. Then, when prompted with an entity and a number of relations, 1 Our source code is available at: https://github. com/IBM/kgi-slot-filling it fills out an infobox for the entity. Some slot filling systems provide evidence text to explain the predictions. Figure 1 illustrates the slot filling task.
Many KBP systems described in the literature commonly involve complex pipelines for named entity recognition, entity co-reference resolution and relation extraction (Ellis et al., 2015). In particular, the task of extracting relations between entities from text has been shown to be the weakest component of the chain. The community proposed different solutions to improve relation extraction performance, such as rule-based (Angeli et al., 2015), supervised (Zhang et al., 2017), or distantly supervised (Glass et al., 2018). However, all these approaches require a considerable human effort in creating hand-crafted rules, annotating training data, or building well-curated datasets for bootstrapping relation classifiers.
Recently, pre-trained language models have been used for slot filling , opening a new research direction that might provide an effective solution to the aforementioned problems. In particular, the KILT benchmark , standardizes two zero-shot slot filling tasks, zsRE (Levy et al., 2017) and T-REx (Elsahar et al., 2018), providing a competitive evaluation framework to drive advancements in slot filling. However, the best performance achieved by the current retrieval-based models on the two slot filling tasks in KILT are still not satisfactory. This is mainly due to the lack of retrieval performance that affects the generation of the filler as well.
In this work, we propose KGI (Knowledge Graph Induction), a robust system for slot filling based on advanced training strategies for both Dense Passage Retrieval (DPR) and Retrieval Augmented Generation (RAG) that shows large gains on both T-REx (+38.24% KILT-F1) and zsRE (+21.25% KILT-F1) datasets if compared to previously submitted systems. We extend the training strategies of DPR with hard negative mining (Simo-Serra et al., 2015), demonstrating its importance in training the context encoder.
In addition, we explore the idea of adapting KGI to a new domain. The domain adaptation process consists of indexing the new corpus using our pretrained DPR and substituting it in place of the original Wikipedia index. This enables zero-shot slot filling on the new dataset with respect to a new schema, avoiding the additional effort needed to rebuild NLP pipelines. We provide a few additional examples for each new relation, showing that zeroshot performance quickly improves with a few-shot learning setup. We explore this approach on a variant of the TACRED dataset (Alt et al., 2020) that we specifically introduce to evaluate the zero/fewshot slot filling task for domain adaption.
The contributions of this work are as follows: 1. We describe an end-to-end solution for slot filling, called KGI, that improves the state-ofthe-art in the KILT slot filling benchmarks by a large margin.
2. We demonstrate the effectiveness of hard negative mining for DPR when combined with end-to-end training for slot filling tasks.
3. We evaluate the domain adaptation of KGI using zero/few-shot slot filling, demonstrating its robustness on zero-shot TACRED, a benchmark released with this paper.
4. We publicly release the pre-trained models and source code of the KGI system.
Section 2 present an overview of the state of the art in slot filling. Section 3 describes our KGI system, providing details on the DPR and RAG models and describing our novel approach to hard negatives. Our system is evaluated in Sections 4 and 5 which include a detailed analysis. Section 6 concludes the paper and highlights some interesting direction for future work.

Related Work
The use of language models as sources of knowledge (Petroni et al., 2019;Roberts et al., 2020;Wang et al., 2020;, has opened tasks such as zero-shot slot filling to pre-trained transformers. Furthermore, the introduction of retrieval augmented language models such as RAG (Lewis et al., 2020b) and REALM (Guu et al., 2020) also permit providing textual provenance for the generated slot fillers.
KILT  was introduced with a number of baseline approaches. The best performing of these is RAG (Lewis et al., 2020b). The model incorporates DPR (Karpukhin et al., 2020) to first gather evidence passages for the query, then uses a model initialized from BART (Lewis et al., 2020a) to do sequence-to-sequence generation from each evidence passage concatenated with the query in order to generate the answer. In the baseline RAG approach only the query encoder and generation component are fine-tuned on the task. The passage encoder, trained on Natural Questions (Kwiatkowski et al., 2019) is held fixed. Interestingly, while it gives the best performance of the baselines tested on the task of producing slot fillers, its performance on the retrieval metrics is worse than BM25 . This suggests that fine-tuning the entire retrieval component could be beneficial. Another baseline in KILT is BART LARGE fine-tuned on the slot filling tasks but without the usage of the retrieval model.
In an effort to improve the retrieval performance, Multi-task DPR (Maillard et al., 2021) used the multi-task training of the KILT suite of benchmarks to train the DPR passage and query encoder. The top-3 passages returned by the resulting passage index were then combined into a single sequence with the query and a BART model was used to produce the answer. This resulted in large gains in retrieval performance.
DensePhrases (Lee et al., 2021) is a different approach to knowledge intensive tasks with a short answer. Rather than index passages which are then consumed by a reader or generator component, it indexes the phrases in the corpus that can be potential answers to questions, or fillers for slots. Each phrase is represented by the pair of its start and end token vectors from the final layer of a transformer initialized from SpanBERT (Joshi et al., 2020). GENRE (Cao et al., 2021) addresses the retrieval task in KILT slot filling by using a sequence-to-sequence transformer to generate the title of the Wikipedia page where the answer can be found. This method can produce excellent scores for retrieval but it does not address the problem of producing the slot filler. It is trained on BLINK  and all KILT tasks jointly.
Open Retrieval Question Answering (ORQA)  introduced neural information retrieval for the related task of factoid question answering. Like DPR, the retrieval is based on a biencoder BERT  model. Unlike DPR, ORQA projects the BERT [CLS] vector to a lower dimensional (128) space. It also uses the inverse cloze pre-training task for retrieval, while DPR does not use retrieval specific pre-training. Figure 2 shows KGI, our approach to zero-shot slot filling, combining a DPR model and RAG model, both trained for slot filling. We initialize our models from the Natural Questions (Kwiatkowski et al., 2019) trained models for DPR and RAG available from Hugging Face (Wolf et al., 2020) 2 . We then employ a two phase training procedure: first we train the DPR model, i.e. both the query and context encoder, using the KILT provenance ground truth. Then we train the sequence-to-sequence generation and further train the query encoder using only the target tail entity as the objective. It is important to note that the same query encoder component is trained in both phases.

DPR for Slot Filling
Our approach to DPR training for slot filling is an adaptation of the question answering training in the 2 https://github.com/huggingface/ transformers original DPR work (Karpukhin et al., 2020). We first index the passages using a traditional keyword search engine, Anserini 3 . The head entity and the relation are used as a keyword query to find the topk passages by BM25. Passages with overlapping paragraphs to the ground truth are excluded as well as passages that contain a correct answer. The remaining top ranked result is used as a hard negative for DPR training. This is the hard negative mining strategy used by DPR (Karpukhin et al., 2020) and Multi-DPR (Maillard et al., 2021). After locating a hard negative for each query, the DPR training data is a set of triples: query, positive passage (given by the KILT ground truth provenance) and the hard negative passage. Figure  3 shows the training process for DPR. For each batch of training triples, we encode the queries and passages independently. The passage and query encoders are BERT  models. Then we find the inner product of all queries with all passages. The negatives for a given query are therefore the hard negative and the batch negatives, i.e. the positive and hard negative passages for other queries in the batch. After applying a softmax to the score vector for each query, the loss is the negative log-likelihood for the positive passages.
Using the trained DPR passage encoder we generate vectors for the approximately 32 million passages in our segmentation of the KILT knowledge source. Though this is a computationally expensive step, it is easily parallelized. The passage-vectors are then indexed with an ANN (Approximate Nearest Neighbors) data structure, in this case HNSW (Hierarchical Navigable Small World) (Malkov and Yashunin, 2018) using the open source FAISS library (Johnson et al., 2017) 4 . We use scalar quantization down to 8 bits to reduce the memory size.
The query encoder is also trained for slot filling alongside the passage encoder. We inject the trained query encoder into the RAG model for Natural Questions. Due to the loose coupling between the query encoder and the sequence-to-sequence generation of RAG, we can update the pre-trained model's query encoder without disrupting the quality of the generation.
Unlike previous work on zero-shot slot filling, we are training the DPR model specifically for the slot filling task. In contrast, the RAG baseline  used DPR pre-trained on Natural Questions, and Multi-DPR (Maillard et al., 2021) trained on all KILT tasks jointly. Figure 4 illustrates the architecture of RAG (Lewis et al., 2020b). The RAG model is trained to predict the ground truth tail entity from the head and relation query. First the query is encoded to a vector and the top-k (we use k = 5) relevant passages are retrieved from the ANN index. The query is concatenated to each passage and the generator predicts a probability distribution over the possible next tokens for each sequence. These predictions are weighted according to the score between the query and passage -the inner product of the query vector and passage vector. Marginalization then combines the weighted probability distributions to give a single probability distribution for the next token. This enables RAG to train the query encoder through its impact in generation, learning to give higher weight to passages that contribute to generating the correct tokens. Formally, the inputs to the BART model are sequences (s j = p j [SEP] q) that comprise a query q plus retrieved passage p j . The probability for each sequence is determined from the softmax over the retrieval scores (z r ) for the passages. The probability for each output token t i given the sequence s j is a softmax over BART's token prediction logits. Therefore the total probability for each token t i is the log-likelihood summed over all sequences, weighted by each sequence's probability.

RAG for Slot Filling
Beam search is used at inference time to select the overall most likely tail entity. This is the standard beam search for natural language generation in deep neural networks (Sutskever et al., 2014), the only difference is in the way the next-token probabilities are obtained.

Dense Negative Sampling
As Figure 2 shows, the DPR question encoder is trained both by DPR and later by RAG. To examine the influence of this additional training from RAG on the retrieval performance, we compare retrieval metrics before and after RAG fine-tuning. Table  1 shows the large gains from training with RAG after DPR. Note that RAG training is using the weak supervision of the passage's impact in producing the correct answer, rather than the ground truth provenance of DPR training. Since this is likely a disadvantage, we explore the other key difference with DPR and RAG training: RAG uses negatives drawn from the trained index rather than from BM25.  To replicate this feature of RAG in DPR, we introduce hard negatives mined from the learned index. Using the KILT trained DPR models, we index the passages. Then we gather hard negatives for DPR training, with one difference: rather than locating the hard negative passages by BM25, we find the passage by ANN search over the learned dense vector index. We train for an additional two epochs using these hard negatives. Table 1 shows the performance of the different approaches to retrieval. DPR N Q is the DPR model pre-trained  After training with DNS the FAISS indexing with scalar quantization becomes prohibitively slow. We therefore remove all quantization and use four shards (the index is split into four, with the results of each query merged) for our experiments with DNS enabled KGI. Table 2 gives statistics on the two zero-shot slot filling datasets in KILT. While the T-REx dataset is larger by far in the number of instances, the training sets have a similar number of distinct relations. We use only 500k training instances of T-REx in our experiments to increase the speed of experimentation.

KILT Experiments
Since the transformers for passage encoding and generation can accept a limited sequence length, we segment the documents of the KILT knowledge source (2019/08/01 Wikipedia snapshot) into passages. The ground truth provenance for the slot filling tasks is at the granularity of paragraphs, so we align our passage segmentation on paragraph boundaries when possible. If two or more paragraphs are short enough to be combined, we combine them into a single passage and if a single paragraph is too long, we truncate it.

KGI Hyperparameters
We have not done hyperparameter tuning, instead using hyperparameters similar to the original works   Table 3 shows the hyperparameters used in our experiments. We train our models on T-REx using only the first 500k instances. For KGI 1 we use the same hyperparameters except that zsRE is trained for two epochs.
In both KGI systems we use the default of five passages retrieved for each query for use in RAG.

Model Details
Number of parameters KGI is based on RAG and has the same number of parameters: 2 × 110M for the BERT BASE query and passage encoders and 400M for the BART LARGE sequenceto-sequence generation component: 620M in total.
Computing infrastructure Using a single NVIDIA V100 GPU DPR training of two epochs takes approximately 24 hours for T-REx and 2 hours for zsRE. Using a single NVIDIA P100 GPU RAG training for 500k T-REx instances takes two days and 147k instances of zsRE takes 15 hours. The FAISS index on the KILT knowledge source requires a machine with large memory, we use 256GB memory -128GB is insufficient for the indexes without scalar quantization.

Slot Filling Evaluation
As an initial experiment we tried RAG with its default index of Wikipedia, distributed through Hugging Face. We refer to this as RAG-KKS, or RAG without the KILT Knowledge Source, as reported in Table 4. Since the passages returned are not aligned to the KILT provenance ground truth, we do not report retrieval metrics for this experiment. Motivated by the low retrieval performance reported for the RAG baseline by , we experimented with replacing the DPR retrieval with simple BM25 (RAG+BM25) over the KILT knowledge source. We provide the raw BM25 scores for the passages to the RAG model, to weight their impact in generation. We also experimented with the Natural Questions trained DPR,  We use the approach explained in Section 3 to train both the DPR and RAG models. KGI 0 is a version of our system using DPR with hard negative samples from BM25. The successor system, KGI 1 incorporates DPR training using DNS.
The metrics we report include accuracy and F1 on the slot filler, where F1 is based on the recall and precision of the tokens in the answer, allowing for partial credit on slot fillers. Our systems, except for RAG-KKS, also provide provenance information for the top answer. R-Precision and Recall@5 measure the quality of this provenance against the KILT ground truth provenance. Finally, KILT-Accuracy and KILT-F1 are combined metrics that measure the accuracy and F1 of the slot filler only when the correct provenance is provided. Table 4 reports an evaluation on the development set, while Table 5 reports the test set performance of the top systems on the KILT leaderboard. KGI 0 and KGI 1 are our systems, while DensePhrases, GENRE, Multi-DPR, RAG for KILT and BART LARGE are explained briefly in Section 2. KGI 1 gains dramatically in slot filling accuracy over the previous best systems, with gains of over 14 percentage points in zsRE and even more in T-REx. The combined metrics of KILT-AC and KILT-F1 show even larger gains, suggesting that the KGI 1 approach is effective at providing justifying evidence when generating the correct answer. We achieve gains of 21 to 41 percentage points in KILT-AC.
Relative to Multi-DPR, we see the benefit of weighting passage importance by retrieval score and marginalizing over multiple generations, com-pared to the strategy of concatenating the top three passages and running a single sequence-tosequence generation. GENRE is still best in retrieval for T-REx, suggesting that at least for a corpus such as Wikipedia, generating the title of the page can be very effective. A possible explanation for this behaviour is that most relations for a Wikipedia entity are mentioned in its corresponding page.

Analysis
To explore the effect of retrieval on downstream performance we consider two variants of our systems: one using random passages from the index, forcing the system to depend on implicit knowledge, and the another using passages from the ground truth provenance, to measure the upper bound performance for the ideal retrieval system. Evaluation is reported in Table 6 for 3 systems. By supplying these systems with the gold standard passages, we can see both the improvement possible through better retrieval, and the value of good retrieval during training. The best system, KGI 1 is the most effective at generating slot fillers from relevant explicit knowledge because it was trained on more cases of justifying explicit knowledge. However, given random passages it is the worst. It has sacrificed some implicit knowledge for better capabilities in using explicit knowledge.
As shown in Table 5, BART LARGE , which is the best implicit-knowledge baseline system for KILT slot filling, is approximately 40 points lower in in accuracy on T-REx if compared to KGI 1 . To understand the impact of the explicit knowledge provided by DPR, we examine the improvement of KGI over BART LARGE . We consider two main hypotheses: 1) the value of explicit knowledge depends on the relation, and 2) the value of explicit knowledge depends on the corpus frequency of the entities related.
To evaluate hypothesis 1, we consider the most frequent 20 relations in the T-REx Dev set, each occurring at least 40 times. The relations with the lowest relative performance gain are taxonomy and partonomy relations: TAXON-RANK, SUBCLASS-OF, INSTANCE-OF, PART-OF and PARENT-TAXON as well as LANGUAGES-SPOKEN,-WRITTEN-OR-SIGNED and SPORT. This suggest that essential properties of entities are well encoded in the language model itself. Inspecting the LANGUAGES-SPOKEN,-WRITTEN-OR-SIGNED we find that sur-   Table 6: T-REx Accuracy with Random and Gold Retrieval face level information (i.e. French name vs. Russian name) is often sufficient for the correct prediction.
In contrast, the relations that gain the most from explicit knowledge are: PERFORMER, MEMBER-OF-SPORTS-TEAM, AUTHOR, PLACE-OF-BIRTH, COUNTRY-OF-ORIGIN, CAST-MEMBER, DIREC-TOR. These relations are not central to the meaning of the head entity, like the taxonomy and partonomy relations, and are not typically predictable from surface-level features.
Regarding our second hypothesis, we might expect that more frequent entities have better representations in the parameters of a pre-trained language model, and that therefore the gain in performance due to use of explicit knowledge will show a strong dependence on the corpus frequency of the head or tail entity.
To test it, we group the Dev instances in T-Rex according to the decile of the head or tail entity frequency. We compute a macro-accuracy, weighting all relations equally. Figure 5 shows the macroaccuracy of BART LARGE and KGI 1 for each decile of head and tail entity frequency. Although there is a general trend of higher accuracy for more fre-quent tail entities and lower accuracy for more frequent head entities, there is no pattern to the gain of explicit knowledge over implicit knowledge from entity frequency. There is a similar picture when considering the decile of the minimum of the head or tail entity frequency. This falsifies our second hypothesis and suggests implicit knowledge is distinct in kind from explicit knowledge, rather than merely under-trained for low frequency entities.

Domain Adaptation Experiments
In this section, we evaluate the domain adaptation capability of KGI. For this purpose, we re-organize a dataset specifically designed to evaluate standard supervised relation extraction models, such as TA-CRED, with the aim to create a zero-shot (and fewshot) slot filling benchmark where the documents are written with a different style than Wikipedia, and the relations in the KG are different from those in Wikidata. In order to perform an in-depth comparison and analysis, we also propose a new set of ranking baselines and use metrics which are suitable to better evaluate the slot filling task in a zero-shot setup.

Zero-shot TACRED
The TACRED dataset was originally proposed by Zhang et al. (2017) with the goal to provide a high-quality training set to supervise a relation extraction model which is shown to be competitive on TAC-KBP 2015 (Ellis et al., 2015). The target KG schema consists of two infoboxes modeling the person and organization entity types, with 41 relation types in total. For our experiments, we adopt a revisited version of TACRED (Alt et al., 2020), in which a second stage crowdsourcing is performed to further improve the quality of the annotations and resolve conflicts among relations. In a typical supervised relation extraction setup, a model is trained to predict (i.e. classify) the right relation type given a textual passage and two entity mentions as inputs. In this paper we used the TACRED dataset as a slot filling benchmark, using the following procedure: 1) we first create the corpus by merging all the plain textual passages from the instances in the train, dev and test sets; 2) we collect the annotated triples, i.e. subjectrelation-object, from the test data to come up with a ground-truth KG to be used for slot filling evaluation 5 ; 3) we remove all the triples from the original test set where the subjects are pronouns. The resulting KG consists of 2673 slot filling test instances. Similarly, we acquire a KG from the train/dev sets to further fine-tune the KGI system as described in the next section. To enable zero-shot experiments, we also convert each relation label into a relation phrase by removing the namespaces per: and org:, and replacing the '_' character with a space. Finally, for each pre-annotated entity in the corpus, we pre-compute an inverted index consisting of a list of co-occurring entities in the textual passages. We use this inverted index to compare our model with a set of ranking baselines.
An example of the obtained ground truth is illustrated in Table 7: given the query [Dominick Dunne, employee of, ?], a slot filling system is supposed to identify the missing slot with Vanity Fair, i.e. the gold standard object in the KG, by retrieving it from the collection of passages.

Slot Filling Evaluation
Task Given a slot filling query (e, s, ?) and a list of possible slot values [v 1 , ..., v n ], where e is the entity as subject, s is the slot/relation and v i are the object candidates that co-occur with e in the corpus, we can frame the zero-shot slot filling as a ranking problem: argmax i score M (e, s, v i ). score M is a function that takes as input a triple and provide a score based on the model M . Turning the slot filling into a ranking problem has two advantages: 1) we can compare the generative approach with a new set of baselines, and 2) we can limit the generation of the slot values to a pre-defined set of domain specific entities.
Models In order to adapt KGI 1 , as pre-trained on T-REx, to the TACRED corpus, we indexed the textual passages using DPR, as described in Section 3. Then we replaced the original Wikipedia index with this new index. During the inference step, we restrict the generation of the slot values using the list of object candidates, i.e. the entities which co-occur with the subject from the inverted index, to facilitate comparability to a set of ranking baselines. To this aim, we adopt the technique described by  to restrict the vocabulary of tokens during the generation.
We use three baselines to compare with our approach for this zero-shot slot filling task. PMI is implemented using the pointwise mutual information between e and v i based on their co-occurrence in the corpus. Also, we train a Word2Vec (Mikolov et al., 2013) skip-gram model on the textual corpus, and we use it to implement the scoring function as cosine(e + s, v i ), for each candidate filler v i . It is based on the assumption that a relation s between two (multi)word embeddings e and v can be represented as an offset vector (v − e) = s ⇐⇒ (e+s) = v (Rossiello et al., 2019;Vylomova et al., 2016). Finally, GPT-2 computes the perplexity of the fragment of text by concatenating the tokens in e, s and each v i (Radford et al., 2019).
Metrics Due to the similarity of slot filling with the knowledge base completion task, we use Mean Reciprocal Rank (MRR) and HIT@k, with k = [1, 5, 10], as evaluation metrics (Bordes et al., 2013). Note that HIT@1 has the same meaning of the accuracy for the downstream task on KILT.    Results Table 8 reports the results of our evaluation. KGI 1 achieves substantially better performance than the aforementioned zero-shot baselines on all evaluation metrics. However, HIT@1 is ∼ 28% which is significantly lower compared with the numbers reported on the datasets in KILT. This begs the question, how to further improve the transfer learning capabilities of these generative models? Interestingly, HIT@5/10 are high (i.e. ∼ 64%/76%). This indicates our approach would be useful in a human-in-the-loop scenario by providing valuable candidates for the fillers that can be further validated.
For this purpose, we also conduct few-shot experiments to understand the robustness of KGI 1 by fine-tuning it with very limited amounts of training examples. We randomly pick n example(s) for each relation type from the TACRED training set, with n = [1, 4]. Table 9 gives our hyperparameters for the TACRED few-shot experiments. We show that our system benefits from additional domain specific training data selected from TACRED. Just using one example and four examples per relation, HIT@1 improves ∼ 5 and ∼ 10 percentage points respectively.

Conclusion
In this paper, we presented KGI, a novel approach to zero-shot slot filling. KGI improves Dense Passage Retrieval using hard negatives from the dense index, and implements a robust training procedures for Retrieval Augmented Generation. We evaluated KGI on both T-REx and zsRE slot filling datasets, ranking at top-1 position in the KILT leaderboard with a net improvement of +38.24 and +21.25 percentage points in KILT-F1, respectively. Moreover, we proposed and release a new benchmark for zero/few-shot slot filling based on TACRED to evaluate domain adaptation where our system obtained much better zero-shot results compared with the baselines. In addition, we have observed significant improvement in results for KGI when rapidly fine-tuned in a few-shot setting. This work opens promising future research directions for slot filling and other related tasks. We plan to apply DPR with dense negative sampling to other tasks in the KILT benchmark, including dialogue and question answering. Likewise, an in-depth investigation on more effective strategies for domain adaptation, such as the combination of zero-shot and few-shot learning involving human-in-the-loop techniques, would be another interesting direction to explore.