Evaluating the Knowledge Base Completion Potential of GPT

Structured knowledge bases (KBs) are an asset for search engines and other applications, but are inevitably incomplete. Language models (LMs) have been proposed for unsupervised knowledge base completion (KBC), yet, their ability to do this at scale and with high accuracy remains an open question. Prior experimental studies mostly fall short because they only evaluate on popular subjects, or sample already existing facts from KBs. In this work, we perform a careful evaluation of GPT's potential to complete the largest public KB: Wikidata. We find that, despite their size and capabilities, models like GPT-3, ChatGPT and GPT-4 do not achieve fully convincing results on this task. Nonetheless, they provide solid improvements over earlier approaches with smaller LMs. In particular, we show that, with proper thresholding, GPT-3 enables to extend Wikidata by 27M facts at 90% precision.

Recently, LMs have been purported as a promising source of structured knowledge.Starting from the seminal LAMA paper (Petroni et al., 2019), a throve of works have explored how to better probe, train, or fine-tune these LMs (Liu et al., 2022).
Nonetheless, we observe a certain divide between these late-breaking investigations, and practical KB completion.While recent LM-based approaches often focus on simple methodologies that produce fast results, practical KBC so far is a highly precision-oriented, extremely laborious process, involving a very high degree of manual labour, either for manually creating statements (Vrandečić and Krötzsch, 2014), or for building comprehensive scraping, cleaning, validation, and normalization pipelines (Auer et al., 2007;Suchanek et al., 2007).For example, part of Yago's success stems from its validated >95% accuracy, and according to (Weikum et al., 2021), the Google Knowledge Vault was not deployed into production partly because it did not achieve 99% accuracy.Yet, many previous LM analyses balance precision and recall or report precision/hits@k values, implicitly tuning systems towards balanced recall scores resulting in impractical precision.It is also important to keep in mind the scale of KBs: Wikidata currently contains around 100 million entities and 1.2B statements.The cost of producing such KBs is massive.An estimate from 2018 sets the cost per statement at 2 $ for manually curated statement, and 1 ct for automatically extracted ones (Paulheim, 2018).Thus, even small additions in relative terms might correspond to massive gains in absolute numbers.For example, even by the lower estimate of 1 ct/statement, adding one statement to just 1% of Wikidata humans would come at a cost of 100,000 $.
In this paper, we conduct a systematic analysis of the KB completion potential of GPT, where we focus on high precision.We evaluate by employing (i) a recent KB completion benchmark, WD-KNOWN, (Veseli et al., 2023), which randomly samples facts from Wikidata and (ii) by a manual evaluation of subject-relation pairs without object values.Our main results are: 1.For the long-tail entities of WD-KNOWN, GPT models perform considerably worse than what less demanding benchmarks like LAMA (Petroni et al., 2019) have indicated.Nonetheless, we can achieve solid results for languagerelated, socio-demographic relations (e.g., na-tiveLanguage).
2. Despite their fame and size, out of the box, the GPT models, including GPT-4, do not produce statements of a high enough accuracy as typically required for KB completion.
3. With simple thresholding, for the first time, we obtain a method that can extend the Wikidata KB at extremely high quality (>90% precision), at the scale of millions of statements.
Based on our analysis of 41 common relations, we would be able to add a total of 27M highaccuracy statements.
2 Background and Related Work KB construction KB construction has a considerable history.One prominent approach is by human curation, as done e.g., in the seminal Cyc project (Lenat, 1995), and this is also the backbone of today's most prominent public KB, Wikidata (Vrandečić and Krötzsch, 2014).Another popular paradigm is the extraction from semistructured resources, as pursued in Yago and DBpedia (Suchanek et al., 2007;Auer et al., 2007).Extraction from free text has also been explored (e.g., NELL (Carlson et al., 2010)).A popular paradigm has been embedding-based link prediction, e.g., via tensor factorization like Rescal (Nickel et al., 2011), and KG embeddings like TransE (Bordes et al., 2013).An inherent design decision in KBC is the P/R trade-off -academic projects are often open to trade these freely (e.g., via F-1 scores), yet production environments are often critically concerned with precision, e.g., Wikidata generally discouraging statistical inferences, and industrial players likely use to a considerable degree human editing and verification (Weikum et al., 2021).
For example in all of Rescal, TransE, and LAMA, the main results focus on metrics like hits@k, MRR, or AUC, which provide no bounds on precision.
LMs for KB construction Knowledge extraction from LMs provides fresh hope for the synergy of automated approaches and high-precision curated KBs.It provides remarkably straightforward access to very large text corpora: The basic idea by (Petroni et al., 2019) is to just define one template per relation, then query the LM with subject-instantiated versions, and retain its top prediction(s).A range of follow-up works appeared, focusing, e.g., on investigating entities, improving updates, exploring storage limits, incorporating unique entity identifiers, and others (Shin et al., 2020;Poerner et al., 2020;Cao et al., 2021;Roberts et al., 2020;Heinzerling and Inui, 2021;Petroni et al., 2020;Elazar et al., 2021;Razniewski et al., 2021;Cohen et al., 2023;Sun et al., 2023).Nonetheless, we observe the same gaps as above: The high-precision area, and completion of already existing resources, are not well investigated.
Several works have analyzed the potential of larger LMs, specifically GPT-3 and GPT-4,.They investigate few-shot prompting for extracting factual knowledge for KBC (Alivanistos et al., 2023) or for making the factual knowledge in a LM more explicit (Cohen et al., 2023).These models can aid in building a knowledge base on Wikidata or improving the interpretability of LMs.Despite the variance in the precision of extracted facts from GPT-3, it can peak at over 90% for some relations.
Recently, GPT-4's capabilities for KBC and reasoning were examined (Zhu et al., 2023).This research compared GPT-3, ChatGPT, and GPT-4 on information extraction tasks, KBC, and KGbased question answering.However, these studies focus on popular statements from existing KBs, neglecting the challenge of introducing genuinely new knowledge in the long tail.
In (Veseli et al., 2023), we analyzed to which degree BERT can complete the Wikidata KB, i.e., provide novel statements.Together with the focus on high precision, this is also the main difference of the present work to the works cited above, which evaluate on knowledge already existing in the KB, and do not estimate how much they could add.

Analysis Method
Dataset We consider the 41 relations from the LAMA paper (Petroni et al., 2019).For automated evaluation and threshold finding, we employ the WD-KNOWN dataset (Veseli et al., 2023) tail entities, by randomly sampling from Wikidata, a total of 4 million statements for 3 million subjects in 41 relations (Petroni et al., 2019).Besides this dataset for automated evaluation, for the main results, we use manual evaluation on Wikidata entities that do not yet have the relations of interest.For this purpose, for each relation, we manually define a set of relevant subject types (e.g., software for developedBy), that allows us to query for subjects that miss a property.
Evaluation protocol In the automated setting, we first use a retain-all setting, where we evaluate the most prominent GPT models (GPT-3 textdavinci-003, GPT-4, and ChatGPT gpt-3.5-turbo)by precision, recall, and F1.Table 1 shows that none of the GPT models could achieve precision of >90%.In a second step, the precision-thresholding setting, we therefore sort predictions by confidence and evaluate by recall at precision 95% and 90% (R@P95 and R@P90).To do so, we sort the predictions for all subjects in a relation by the model's probability on the first generated token2 , then compute the precision at each point of this list, and return the maximal fraction of the list covered while maintaining precision greater than the de-sired value.We threshold only GPT-3, because only GPT-3's token probabilities are directly accessable in the API, and because the chat-aligned models do not outperform it in the retain-all setting.
Approaches to estimate probabilities post-hoc can be found in (Xiong et al., 2023).Since automated evaluations are only possible for statements already in the KB, in a second step, we let human annotators evaluate the correctness of 800 samples of novel (out-of-KB) high-accuracy predictions.We hereby use a relation-specific threshold determined from the automated 75%-95% precision range.MTurk annotators could use Web search to verify the correctness of our predictions on a 5-point Likert scale (correct/likely/unknown/implausible/false).We counted predictions that were rated as correct or likely as true predictions, and all others as false.
Prompting setup To query the GPT models, we utilize instruction-free prompts listed in the appendix.Specifically for GPT-3, we follow the prompt setup of (Cohen et al., 2023), which is based on an instruction-free prompt entirely consisting of 8 randomly sampled and manually checked examples.In the default setting, all example subjects have at least one object.Since none of the GPT models achieved precision >90% and we can only threshold GPT-3 for high precision, we focus on the largest GPT-3 model (text-davinci-003) in the following.We experimented with three variations for prompting this model: 1. Examples w/o answer: Following (Cohen et al., 2023), in this variant, we manually selected 50% few-shot examples, where GPT-3 did not know the correct answer, to teach the model to output "Don't know".This is supposed to make the model more conservative in cases of uncertainty. 2.

Results and Discussion
Can GPT models complete Wikidata at precision AND scale?In Table 1 we already showed that without thresholding, none of the GPT models can achieve sufficient precision.Table 2 shows our main results when using precision-oriented thresholding, on the 16 best-performing relations.The fourth column shows the percentage of subjects for which we obtained high-confidence predictions, the fifth how these translates into absolute statement numbers, and the sixth shows the percentages that were manually verified as correct (sampled).In the last column, we show how this number relates to the current size of the relation.We find that manual precision surpasses 90% for 5 relations, and 80% for 11.Notably, the bestperforming relations are mostly related to sociodemographic properties (languages, citizenship).
In absolute terms, we find a massive number of high-accuracy statements that could be added to the writtenIn relation (18M), followed by spo-kenLanguage and nativeLanguage (4M each).In relative terms, the additions could increase the existing relations by up to 1200%, though there is a surprising divergence (4 relations over 100%, 11 relations below 20%).
Does GPT provide a quantum leap?Generating millions of novel high-precision facts is a significant achievement, though the manually verified precision is still below what industrial KBs aim for.The wide variance in relative gains also shows that GPT only shines in selected areas.In line with previous results (Veseli et al., 2023), we find that GPT can do well on relations that exhibit high surface correlations (person names often give away their nationality), otherwise the task remains hard.
In Table 3 we report the automated evaluation of precision-oriented thresholding.We find that on many relations, GPT-3 can reproduce existing statements at over 95% precision, and there are significant gains over the smaller BERT-large model.At the same time, it should be noted that (Sun et al., 2023) observed that for large enough models, parameter scaling does not improve performance further, so it is well possible that these scores represent a ceiling w.r.t.model size.
Is this cost-effective?Previous works have estimated the cost of KB statement construction at 1 ct.(highly automated infobox scraping) to $2 (manual curation) (Paulheim, 2018).Based on our prompt size (avg.174 tokens), the cost of one query is about 0.35 ct., with filtering increasing the cost per retained statement to about 0.7 ct.So LM prompting is monetarily competitive to previous infobox scraping works, though with much higher recall potential.
In absolute terms, prompting GPT-3 for all 48M incomplete subject-relation pairs reported in Table 2 would amount to an expense of $168,000, and yield approximately 27M novel statements.
Does "Don't know" prompting help?In Table 4 (middle) we show the impact of using examples without an answer.The result is unsystematic, with notable gains in several relations, but some losses in others.Further research on calibrating model confidences seems important (Jiang et al., 2021;Singhania et al., 2023).

GPT-3 text-davinci-003
GPT-3 text-curie-001 BERTLarge Relation R@P 95 R@P 90 R@P 95 R@P 90 R@P 95 R@P 90 Does textual context help?Table 4 (right) shows the results for prompting with context.Surprisingly, this consistenly made performance worse, with hardly any recall beyond 90% precision.This is contrary to earlier findings like (Petroni et al., 2020) (for BERT) or (Mallen et al., 2023) (for QA), who found that context helps, especially in the long tail.Our analysis indicates that, in the highprecision bracket, misleading contexts cause more damage (lead to high confidence in incorrect answers), than what helpful contexts do good (boost correct answers).

How many few-shot examples should one use?
Few-shot learning for KBC works with remarkably few examples.While our default experiments, following (Cohen et al., 2023), used 8 examples, we found actually no substantial difference to smaller example sizes as low as 4.

Conclusion
We provided the first analysis of the real KB completion potential of GPT.Our findings indicate that GPT-3 could add novel knowledge to Wikidata, at unprecedented scale and quality (27M statements at 90% precision).Compared with other approaches the estimated cost of $168,000 is surprisingly cheap, and well within the reach of industrial players.We also find that, in the high-precision bracket, GPT-3 distills web content to a degree that context augmentation does not easily help.
Open issues remain in particular around identifying high-confidence predictions within an LM's generations (Jiang et al., 2021;Singhania et al., 2023;Xiong et al., 2023), and the choice of examples.

Limitations
Using LMs for automated knowledge generation comes with the standard risk of exacerbating demographic biases.For example, many of the bestperforming relations are language-related, where the model presumably often estimates a person's native language entirely from their name.
In terms of reproducibility, it should be noted that our results are tied to a closed-source commercial API.Although GPT-3/4/chatGPT are widely used in research and industry, and OpenAI has announced plans to keep stable model versions online, long-term reproducibility is not ensured, and the internal workings of GPT are publicly not known.
Although statement generation is at the core of KB completion, for a complete KBC pipeline, we are still missing critical components.This concerns in particular entity disambiguation, which is essential for relations with more than a few hundred possible object values.Similarly, Wikidata and other KBs give critical importance to scrutable referencing of statements.This is not easily possible with LMs.

A Predictions
In the following we show some predictions from our different experimental setups.Q marks the query, P the prediction by GPT-3 and A the ground truth object.Please note that to prompt GPT-3, relation-specific examples are used at the beginning of each query.The prompts w/ examples can be seen in Appendix B.

A.1 Standard
This section shows predictions from our standard setup with the number of few-shot examples being 8.The prompts B.1.
A.1.1 GPT-3 Q: As It Is in Heaven # original language P: Swedish A: Swedish Q: John Gruber # employer P: Daring Fireball A: The Pittsburg Times, Pittsburgh Post, Pittsburgh Gazette, The Gazette Times, Pittsburgh Pirates Q: Hillary Canyon # continent P: Antarctica A: North America Q: The Incredible Hulk # developer P: Edge of Reality A: Edge of Reality, Amaze Entertainment Q: Architecture and secondary structure of an entire HIV-1 RNA genome # language of work P: English A: English A.1.2GPT-4 Q: Who are the developers of The Incredible Hulk?Give me a list with no additional text.P: Edge of Reality A: Radical Entertainment, Edge of Reality, Amaze Entertainment, Pocket Studios, Universal Interactive Q:In which languages is Architecture and secondary structure of an entire HIV-1 RNA genome available?Give me a list with no additional text.P: English A: Iḿ sorry, but I couldn't find specific information about the languages in which the mentioned scientific study or information is available.
A.1.3ChatGPT Q: Who are the developers of The Incredible Hulk?Give me a list with no additional text.P: Edge of Reality A: Edge of Reality, Amaze Entertainment, Beenox, Foundation 9 Entertainment, Griptonite Games, Fizz Factor, Nucore Q:In which languages is In which languages is Architecture and secondary structure of an entire HIV-1 RNA genome available?Give me a list with no additional text.available?Give me a list with no additional text.P: English A: English

A.2 Textual context
In this section we show some predictions from our standard setup with textual contexts.The examples used for prompting in this setup can be seen in section B.2.

Table 4 :
Effect of variations to the standard prompting setting.
The St. Lawrence River is a large river in the middle latitudes of North America.Its headwaters begin flowing from Lake Ontario in a roughly northeasterly...St. Lawrence River, hydrographic system of eastcentral North America.It starts at the outflow of Lake Ontario and leads into the Atlantic Ocean in the... A: North America Q: Cerro El Charabón # continent C: 65, Estancia El Charabón.49. 66, Área costero-marina Cerro Verde e Islas de la Coronilla-Área General.48.67, Area protegida Laguna de Castillos -Tramo... Casa del Sol Boutique Hotel.A cozy stay awaits you in Machu Picchu.... Altiplánico San Pedro de Atacama ... Welcome to El Charabon.El Charabon.A: Americas Q: Hinterer Seekopf # continent C: Following the breakup of Pangea during the Mesozoic era, the continents of ... of the best day hikes in Kalkalpen National Park is the Hoher Nock -Seekopf.Dec 5, 2016 ... Hinterer Steinbach.Inhaltsverzeichnis aufklappen ... Inhaltsverzeichnis einklappen ... Charakteristik.Hinweise; Subjektive Bewertung... A: Europe Q: Šembera # continent C: Rephrasing Heidegger: A Companion to Heidegger's Being and Time [Sembera, ... Being and Time (Suny Series in Contemporary Continental Philosophy).Feb 26, 2016 ... Coming from Uganda, UNV PO Flavia Sembera was familiar with diversity.... shared across the continent while experiencing Zambia's beautiful... A: Europe largest professional community.Andrzej has 15 jobs listed on their profile.Nov 14, 2016 ... Congratulations to the newest Java Champion Andrzej Grzesik! ... in Poland (sfi.org.pl) and in his work as a Sun Campus Ambassador.