The Web Can Be Your Oyster for Improving Large Language Models

Large language models (LLMs) encode a large amount of world knowledge. However, as such knowledge is frozen at the time of model training, the models become static and limited by the training data at that time. In order to further improve the capacity of LLMs for knowledge-intensive tasks, we consider augmenting LLMs with the large-scale web using search engine. Unlike previous augmentation sources (e.g., Wikipedia data dump), the web provides broader, more comprehensive and constantly updated information. In this paper, we present a web-augmented LLM UNIWEB, which is trained over 16 knowledge-intensive tasks in a unified text-to-text format. Instead of simply using the retrieved contents from web, our approach has made two major improvements. Firstly, we propose an adaptive search engine assisted learning method that can self-evaluate the confidence level of LLM's predictions, and adaptively determine when to refer to the web for more data, which can avoid useless or noisy augmentation from web. Secondly, we design a pretraining task, i.e., continual knowledge learning, based on salient spans prediction, to reduce the discrepancy between the encoded and retrieved knowledge. Experiments on a wide range of knowledge-intensive tasks show that our model significantly outperforms previous retrieval-augmented methods.


Introduction
With large-scale neural networks, pretrained language models (PLMs) (Brown et al., 2020;Zhao et al., 2023) can encode a large amount of world knowledge, showing phenomenal capability in knowledge-intensive tasks such as fact checking and open-domain question answering (QA).However, this capacity is naturally limited by the information contained in pretraining or finetuning Table 1: An example showing that the web covers both more comprehensive (e.g., Korean show) and up-to-date (e.g., recently) information than Wikipedia.Based on the latest news returned by Google Search, T5-LARGE can answer the question correctly.datasets (usually fixed once collected), which are neither up-to-date nor complete (Komeili et al., 2021;Ji et al., 2022).Although model scaling (Brown et al., 2020;Chowdhery et al., 2022;Thoppilan et al., 2022) is a viable way to improve the knowledge capacity of PLMs, it still uses static pretraining datasets, and also leads to significantly larger computational costs with increased model sizes.As a result, the outdated or incomplete knowledge encoded by PLMs may lead to hallucination or incorrect generations even though the results look plausible (Ji et al., 2022).
Recently, by drawing the idea from semiparametric approaches (Zhao et al., 2022;Guu et al., 2020;Lewis et al., 2020b;Borgeaud et al., 2022), retrieval-augmented approaches have been proposed to equip PLMs with the ability to directly access an external database.As a major knowledge resource, Wikipedia has been widely used in previous work.While being highly accurate and wellstructured, Wikipedia only covers limited information, both in scope and in time.Besides, even for the topics that Wikipedia covers, grounding PLMs' decisions on a single source of knowledge may create biases (Wagner et al., 2016).Considering these issues, it is time to look beyond Wikipedia (or similar single-source databases) and access more broader, in-depth, and up-to-date knowledge from more sources.Inspired by (Komeili et al., 2021;Piktus et al., 2021), we select the web as the retrieval resource for enlarging the knowledge capacity of PLMs.To motivate our approach, Table 1 presents a sample question that T5 successfully answers with the support of the web (providing the latest news), but not Wikipedia.As we can see, timely and relevant supporting evidence is the key to solve such tasks for PLMs.
In this paper, we aim to capitalize on the web as a source of up-to-date and comprehensive knowledge to solve a wide range of knowledge-intensive tasks.Unlike previous web-augmented studies (Nakano et al., 2021;Menick et al., 2022) that mostly focus on single tasks, we seek to develop a unified framework to integrate the use of the web in PLMs for multi-task solving.Although the idea of leveraging the web for improving PLMs is appealing, it is non-trivial to develop an effective solution.First, PLMs do not always need external evidence for task solving, especially considering the fact that the web contains noisy, biased, or harmful information (Luccioni and Viviano, 2021).Simply retrieving knowledge without considering the example difficulty and PLMs' own capabilities may steer models towards unexpected outputs.Second, PLMs are usually pretrained at an earlier time on a limited corpus, leading to a discrepancy between the encoded knowledge and the retrieved knowledge (i.e., web contents).Therefore, we need more principled approaches to properly integrating the new knowledge into PLMs.
To address the above issues, we present a webaugmented PLMs, UNIWEB, to improve the capacity in knowledge-intensive tasks.Instead of using neural network-based retriever, we employ a commercial search engine (i.e., Google Search) to obtain high-quality and comprehensive retrieval results from the web.Based on this idea, we make two major technical contributions.First, we propose a search engine assisted learning method that can selectively query the web only when PLM is unconfident in its predictions.For this purpose, we design a self-evaluation mechanism to estimate the confidence level of PLMs on the task exam-ples.Secondly, to reduce the discrepancy between the encoded and retrieved knowledge, we design a pretraining task, continual knowledge learning, to integrate the retrieved knowledge into PLMs by predicting the salient masked spans in web documents.To train the UNIWEB model, we convert different knowledge-intensive tasks into a unified text-to-text format, and conduct supervised multitask training over 16 tasks across seven categories.
To the best of our knowledge, our model is the first unified web-augmented PLM for a wide range of knowledge-intensive tasks.Extensive experiments show that PLMs can significantly benefit from such an approach and a single unified PLM (UNIWEB) is able to achieve (near) state-of-the-art performance on all 16 tasks.

Related Work
Retrieval-Augmented PLMs.Augmenting a pretrained language model with retrieval has been extensively studied in existing literature (Lewis et al., 2020b;Borgeaud et al., 2022;Izacard et al., 2022;Lee et al., 2019;Guu et al., 2020).For example, REALM (Guu et al., 2020) and RAG (Lewis et al., 2020b), incorporate a differentiable retriever into pretrained models, leading to promising results on question answering.However, these studies usually rely on a sub-optimal retriever to access a static and limited knowledge resource, i.e., Wikipedia.By contrast, our model utilizes the well-developed search engine to gain broader, more in-depth, and up-to-date knowledge from the web.Several studies have also looked at how Internet can help the models, but only focus on single tasks such as question answering (Nakano et al., 2021;Menick et al., 2022) and dialogue (Komeili et al., 2021).We-bGPT (Nakano et al., 2021) uses human feedback to optimize answer quality by hiring massive labelers to judge the accuracy of answers.Komeili et al. (2021) retrieves knowledge from the web for every dialogue without considering the necessity.Piktus et al. (2021) only presents an empirical study to investigate the impact of replacing Wikipedia with a large-scale web-like corpus and adopting different retrieval models.We are also aware of some related studies (Jiang et al., 2023), but we have taken a different active approach for knowledge retrieval.In this paper, we develop a unified language model for solving a wide spectrum of knowledge-intensive tasks.Our model can selectively decide whether to access the web, and continuously learn from the

Commonsense Reasoning
Commonsense QA

Dialogue
Fact Checking

Natural Language Inference
The retrieved knowledge.
Knowledge-Intensive Learning.Recent work has shown that PLMs' parameters have implicitly stored linguistic or factual knowledge (Petroni et al., 2019;Roberts et al., 2020).However, the implicitly encoded knowledge is limited by the model's scale and training data, contradicting the dynamic nature of the world.Hence, many researchers propose to fuse relevant external knowledge from texts with the encoded knowledge of PLMs to deal with knowledge-intensive tasks such as open-domain QA (Guu et al., 2020;Lewis et al., 2020b), entity linking (Wu et al., 2019), fact verification (Liu et al., 2019b), and commonsense reasoning (Lin et al., 2020).Wikipedia has been the most widely used knowledge source for these tasks, which is still limited despite its wide coverage.Instead, we rely on the real-time web.The existing studies usually design task-specific training, architecture, and knowledge fusion method to exploit knowledge sources.In this work, we aim to develop a single unified framework that can be used for most knowledge-intensive tasks.

Task Formulation
Knowledge-intensive tasks (Yin et al., 2022) aim to leverage external knowledge resources to accomplish a broad range of tasks such as open-domain question answering and fact verification.Following prior work (Lewis et al., 2020b;Guu et al., 2020), we employ a retrieval-augmented generation framework that consists of two components: a retriever R and a generator G. Given an input text X such as a question, the retriever R learns to retrieve a set of top-K passages P = {p 1 , ..., p K } from a knowledge resource.Conditioned on the input text X and the retrieved passages P, the gen-erator G aims to generate the output text Y.The model is trained to maximize the joint likelihood: To implement the framework, previous studies usually adopt a trainable neural retriever based on a (single) knowledge resource such as Wikipedia or knowledge bases.However, such an approach can only access limited, static knowledge.In this paper, we rely on a general, off-the-shelf search engine as the retriever to access both comprehensive and up-to-date knowledge from the whole web.

Approach
Our proposed web-augmented PLM, UNIWEB, is depicted in Figure 1.We first transform knowledgeintensive tasks into a unified text-to-text paradigm and consider the web as a general form of knowledge source.Based on the retrieved knowledge, we further design two training objectives to build our model.In the next sections, we will describe our method in detail.

Knowledge-Intensive Tasks Unification
Previous retrieval-augmented approaches usually adopt diverse architectures and different types of knowledge resources (Yin et al., 2022).Instead, we aim to leverage the general knowledge source (i.e., the web) to develop a unified framework that can fulfill various (or most) knowledge-intensive tasks.Specifically, we unify 16 typical knowledgeintensive tasks across 7 task families, including fact checking, slot filling, dialogue, open-domain question answering, commonsense question answering, commonsense reasoning, and natural language inference.We convert these tasks as a general textto-text transformation for training a unified PLM.
These tasks are mainly from the studies (Petroni et al., 2020;Piktus et al., 2021), in which the original tasks of fact checking, slot filling, dialogue, and open-domain QA are designed specifically based on the retrieved knowledge from Wikipedia, while other tasks of commonsense QA, commonsense reasoning, and natural language inference focus on some more specific commonsense knowledge, going beyond Wikipedia.We consider these knowledge-intensive tasks as typical NLP tasks to show that the large-scale web can be specially useful for satisfying diverse information needs.More details about each task can be found in Appendix A.

Web-based Knowledge Retrieval
Unlike prior work that retrieves documents from offline corpora such as Wikipedia (Guu et al., 2020;Lewis et al., 2020b), we propose to retrieve comprehensive and up-to-date information from the online web through a general-purpose search engine.Although it is intuitive to extend the retrieval-augmented framework with the web as the knowledge resource, it is non-trivial to effectively leverage the knowledge found on the web.The documents on the web have inconsistent quality, and contain noisy, biased, or even harmful contents (Luccioni and Viviano, 2021).Low-quality content may steer PLMs towards seemingly plausible but factually incorrect outputs (Ji et al., 2022).On the other hand, compared to a local neural retriever, black-box search engines can only be accessed through queries, which is less controllable and not easy to filter out noisy contents from the search results.In addition, PLMs do not always need external knowledge for task solving, especially for easy tasks.Therefore, we should request for more knowledge only when needed.

PLM Knowledge Evaluation
To address the above challenges, it is essential to evaluate PLMs' own capabilities in a task and the necessity to refer to external knowledge.In our approach, we consider a non-trivial question before retrieval: does a PLM need to retrieve knowledge for a specific task instance?For this purpose, we investigate whether or not PLMs can correctly answer questions without using external evidence.According to the recent study (Kadavath et al., 2022), PLMs can self-evaluate the confidence level of their generation results (e.g., True or False).Hence, we propose to utilize the self-evaluation mechanism to determine whether it is necessary to access addi-tional web information.
Self-Evaluation.Specifically, we hypothesize that when a model "knows" the true output (i.e., confident about its output) for a specific input, sampling the outputs many times would result in an output distribution with small entropy.Following Kadavath et al. (2022), we sample n (n = 200) different outputs for each input and estimate the entropy of the output distribution as follows: (2) where Ŷ = ⟨w 1 , ..., w i , ..., w m ⟩ is the output text generated by the model G.Then, we set an entropy threshold η.If H( Ŷ|X ) is higher than η, it means that the model is unconfident about its outputs and needs supporting evidence from the web, otherwise, it does not.We will further demonstrate the predictive power of the entropy (Eq.( 2)) in estimating the model confidence for knowledge retrieval.

Web Knowledge Retrieval
In active learning (Ren et al., 2021), a prediction model can interactively query for labeling examples with low confidence levels.This learning method can not only reduce the cost of data labeling, but also remove those noisy and unhelpful data that models cannot benefit from.Inspired by this, we propose a search engine assisted learning approach, in which PLMs choose those hard cases that they cannot solve (assessed by self-evaluation) to query the off-the-shelf search engine for knowledge retrieval.Different from active learning, our approach does not directly query for the final answer (largely reducing the labeling efforts), but instead the supporting evidence for solving the task.After retrieving knowledge from the web, it is critical to filter out noisy contents and select the most helpful and relevant knowledge that can enhance PLMs' confidence to generate correct outputs.Therefore, we elaborate a two-stage filter mechanism to filter the retrieved knowledge.
Search Engine Assisted Learning.Specifically, for those hard examples, we take their input text X verbatim as a search query and issue a call to Google Search via API.For each query, we retrieve top-K HTML pages and parse them to obtain clean texts, resulting in a set of passages P = {p 1 , ..., p K }.To filter out noisy and irrelevant information, in the first stage, we chunk each passage into paragraphs, compute the cosine similarity between input and paragraph embeddings, and select the five most relevant paragraphs to form the final passage.In the second stage, we adopt the same method as self-evaluation (Eq.2) to compute the model confidence given the input and each processed passage and select those passages with high confidence as the final evidence.

Knowledge-based Model Pretraining
In most previous work, the retrieval model is either pretrained using self-supervised objective such as MLM (Guu et al., 2020;Borgeaud et al., 2022) or trained for specific tasks (Lewis et al., 2020b).
In this work, we focus on explicitly training webaugmented PLMs in a supervised and massively multi-task fashion (Aribandi et al., 2022) using the mixture of knowledge-intensive tasks (Section 4.1).Besides, to integrate the retrieved knowledge into PLMs, we design a continual knowledge learning task based on the retrieved passages.
Knowledge-Intensive Learning.This pretraining objective uses the retrieved knowledge and labeled data from the unified knowledge-intensive tasks.Formally, given an input text X and retrieved passages P, this objective is to minimize the negative log-likelihood loss over the output text Y: where w i denotes the i-th token of the output text Y.We concatenate the input text X and retrieved passages P using the manually-written task-specific prompts (shown in Appendix A).Pretrained on the unified knowledge-based text-to-text format, our model can be easily applied to diverse knowledgeintensive tasks.It has been reported that ensembling many tasks, distributions and domains during pretraining can improve PLMs' generalization to new tasks (Aribandi et al., 2022).
Continual Knowledge Learning.Due to the limited pretraining on single static corpus, the knowledge encoded in PLMs has a discrepancy with the retrieved knowledge from the web.Thus, to reduce the discrepancy and integrate the newly retrieved knowledge into PLMs, we design a self-supervised pretraining task, i.e., continual knowledge learning.For most knowledge-intensive tasks such as slot filling and fact verification, named entities are of special importance.Thus, this pretraining task aims to predict the salient masked spans (i.e., named entities) in retrieved passages.Firstly, we use a BERT-based (Devlin et al., 2019) tagger trained on CoNLL-2003 data (Sang andDe Meulder, 2003) to identify name entities and then mask entities such as "United States".Then, our model will be trained to predict these masked spans by minimizing the masked span prediction loss: where s j is the j-th masked span for the passage p k , and pk denotes the unmasked tokens in p k .

Experiments
In this section, we detail the experimental setup and then highlight the main observations of our results.• Slot Filling: T-REx (ElSahar et al., 2018) and zero-shot RE (Levy et al., 2017).
We convert these tasks into a unified text-to-text format.We take the input text as query to retrieve top 10 passages from CCNet.After pre-processing, we mix the training set of these datasets to pretrain our model.We present the statistics of datasets and pre-processing details in Appendix A.
Baselines.We compare UniWeb to a wide range of models as follows: • BART (Lewis et al., 2020a) and T5 (Raffel et al., 2020).These are two representative text-to-text PLMs for solving knowledge-intensive tasks.We adopt the large version for a fair comparison.• REALM (Guu et al., 2020) and RAG (Lewis et al., 2020b).They are two well-known retrievalaugmented PLMs combining with a nonparametric memory of Wikipedia via a neural retriever.• Fusion-in-Decoder (FID) (Izacard and Grave, 2020).It is based on T5 where the encoder encodes the input text with each passage and the decoder combines the encoded representations.• Maillard et al. (2021) and Piktus et al. (2021) equip BART and FID with retrieval models, i.e., BM25 (Robertson et al., 2009), DPR (Karpukhin et al., 2020), DPR MULTI trained in a multi-task fashion, and DPR CCNET trained on CCNet.
Note that these models are trained on individual tasks and datasets, while our model is pretrained in a multi-task manner.We use BM25 to retrieve passages from CCNet during pretraining.The BM25 and DPR indices are collected from the previous word (Piktus et al., 2021).Since it lacks the retrieval supervision to train DPR for those tasks in Table 3, we only report the BM25 results.The implementation details are shown in Appendix B.
Evaluation Metrics.We adopt various tasks and datasets in our experiments, which need to be evaluated differently.Following Petroni et al. (2020), we use Exact Match (EM) for datasets with extractive (i.e., Natural Questions, TriviaQA) or short abstractive output text (i.e., HotpotQA); for datasets with long abstractive output text, we use ROUGE-L (Lin, 2004) for ELI5 and F1-score for Wizard of Wikipedia; we use Accuracy for the remaining tasks.To compute EM and F1-score, we conduct post-processing on the gold and predicted output texts such as lowercasing, stripping, punctuation, and duplicate whitespace (Rajpurkar et al., 2016).

Main Results
Table 2 and Table 3 show the results of UNIWEB and baselines on 16 knowledge-intensive tasks.First, on almost all knowledge-intensive tasks, combining PLMs with explicit retrieved knowledge can achieve higher performance.From Wikipedia and CCNet to the web, we can observe that a broader coverage of knowledge will lead to better results.Compared to BART and T5, retrieval-based models benefit from the retrieved knowledge.
Second, the tasks in Table 2 are specially designed based on the knowledge from Wikipedia.Thus, there is a strong bias towards Wikipedia as the knowledge resource.We can observe that CC-Net only achieves comparable results or even suffers from a large performance drop.However, for the tasks in Table 3 requiring knowledge beyond Wikipedia, CCNet is more competitive.
Finally, our UNIWEB model achieves the best results on most knowledge-intensive tasks.On one hand, our model is trained in a multi-task manner, which can benefit from knowledge sharing across tasks.On the other hand, our model can access broad and up-to-date knowledge from the web via the fine-tuned search engine.The web knowledge can fulfill more diverse information needs.Moreover, the search engine works much better than traditional sub-optimal retrieval methods that rely on end-to-end training or word matching.

Detailed Analysis
We report detailed analysis of UniWeb in several datasets -we have similar finding in other datasets.
Ablation Study.Our UNIWEB model is the first unified PLM using the web as knowledge source for knowledge-intensive tasks.To examine the importance of the web, we design two counterparts: (1) w/ Wikipedia or (2) w/ CCNet replaces the web with Wikipedia or CCNet and adopts BM25 to retrieve documents.Besides, to avoid the negative impact of noisy and biased information, we adopt the self-evaluation method to adaptively access knowledge from the web.Thus, we remove this method to test its effect (w/o SE).Finally, we remove the pretraining task, i.e., continuous knowledge learning, to test its importance (w/o CKL).
The results are shown in   evaluation method benefits our model a lot in terms of knowledge filtering.The pretraining task also improves the knowledge capacity of our model.
Sensitivity Analysis.In the self-evaluation mechanism, we use entropy to evaluate the model confidence.To verify its effectiveness, we present the distribution of H( Ŷ|X ) depending on whether or not the model gets the question correct.As shown in Figure 2(a), the average entropy of the questions for which our model gets correct is lower than that of questions for which our model gets incorrect.This indicates that the entropy has some predictive power of model confidence.Besides, the quality of retrieved documents will largely affect the prediction of our model.Thus, in Figure 2(b), we test the model accuracy by varying the top-K search results in the set of {1-5, 6-10, 11-15, 16-20}.We can see that PLM performance drops with the increase of rank of documents, thus the decrease of document quality.However, the retrieved top 6-10 passages also achieve comparable results to the top 1-5 ones.This is the motivation of our setting K = 10.

Case Study
In this section, we perform the qualitative analysis on REALTIME QA (Kasai et al., 2022), a benchmark requiring real-time, up-to-date, and comprehensive knowledge with a broad range of topics (such as politics, business, sports, and entertainment) to solve questions.The evaluation results are shown in Appendix C. Our UniWeb model with Google Search performs the best.We present an example in Table 5 about "World Cup final 2022" in the sports topic.By using the question text as query, we can retrieve top-1 passages from Wikipedia, CCNet, and web.Since Wikipedia and CCNet are both static and limited knowledge resources, the retrieved passages are not fresh in time ("2014" and "2018") even though they are on the same topic "World Cup".The typical retrieval methods (BM25 or DPR) are largely reliant on fuzzy semantic matching, also leading to incorrect retrieval.While, retrieving from the web using search engine can ensure our model to obtain the most up-to-date and relevant information, based on which it can generate the correct answer "Croatia and Morocco".We present more examples in Appendix D.

Conclusion
This paper presented a unified web-augmented framework for a wide range of knowledge-intensive tasks, called UNIWEB.We convert 16 tasks into a text-to-text generation task for training.We propose a search engine assisted learning method to selectively retrieve documents from the web through Google Search.Furthermore, to reduce the discrepancy between the encoded and retrieved knowledge, we design a pretraining task, i.e., continual knowledge learning, to integrate the retrieved knowledge into LLMs.Experiments on 16 tasks show the effectiveness of our web-augmented model compared to previous retrieval-augmented models.In future work, we will investigate the effect of web content in detail and consider applying our model to more types of downstream tasks.

Limitations
For web-augmented models including our work, the deterioration of search results from search engine highlights the importance of deriving an effective method to interact with the huge web.
Search engines are often perceived as black-box and non-transparent for end users.Therefore, many works proposed "leaning to search" to decompose complex questions into simpler queries, which may improve the performance of web-based models (Nakano et al., 2021;Komeili et al., 2021).
In our model, we used a commercial search engine as the retriever to work with the whole web as a knowledge source.Since the web is not curated and well-structured like Wikipedia, we may encounter unexpected safety issues, including misinformation and harmful contents.While we have relied on the security control of the search engine, more attention should be paid to better understand the risks and provide effective ways to mitigate them.We hope our simple approach and strong results could encourage more future work by the community to tackle these questions.To encourage the community to investigate the question and ensure reproducibility, after the reviewing process, we will release the search URLs used in our experiments.
As for the potential concern, since we use the search engine to access real-time information, we do not have a tight control over retrieved results as traditional end-to-end retrieval (Guu et al., 2020;Lewis et al., 2020b).Not only the changes of search engine logic, but also the newly published information, might create discrepancies over the course of time.This is also an issue we have to tackle to build a stable web-based solution for PLMs.
• Commonsense reasoning is intended to utilize commonsense knowledge to reason about certain aspects of the given text (Sakaguchi et al., 2020).Therefore, we consider the given text as input and the prediction as output.
• Natural language inference is the task of determining whether the given "hypothesis" logically follows from the "premise" (Storks et al., 2019).It acquires deep knowledge about the relationship between hypothesis and premise.We consider the premise as input and the hypothesis as output.
For each category, we choose several representative tasks to construct our pretraining corpus.The detailed information of these included tasks is listed in Table 6.To mitigate the huge disparity between dataset sizes, we follow (Raffel et al., 2020) to use the temperature-scaled mixing strategy with a rate of T = 2 for setting the proportion of data coming from each task.During pretraining, for each task example, we use BM25 to retrieve top-10 passages from CCNet as our external knowledge.The input texts are concatenated with the retrieved passages using manually-written prompts.The final input is constructed in the following format: The "Option" string is applied only when the input text is provided with several candidate answers.The blanks "[passage n ]" and "[option n ]" is filled with the retrieved passages and candidate answers.The blank "[Task Instruction]" aims to indicate the task for our model, which is task-specific and detailed in Table 7.

B Implementation Details
Our UniWeb model uses a Transformer with 12 layers in both encoder and decoder (406M parameters), the same as the model size of BART LARGE (Lewis et al., 2020a).The hidden size is 1,024 and the inner hidden size of the feedforward network is 4,096.We employ the bytepair-encoding (BPE) tokenizer, and the vocabulary size is 50,267.We initialize the backbone with the MVP model (Tang et al., 2022), a supervised pretrained PLM, to provide a good starting point for generation following previous work (Dong et al., 2019;Zhang et al., 2020).We pretrain the model with batch size 8,192 on Tesla A100 40GB GPUs.end for 17: end while 18: return Θ For our model, the maximum length of both input and output sequences is set to 1,024 for supporting examples to contain more tokens.We optimize the model with a constant learning rate of 2 × 10 −5 using standard sequence-to-sequence cross-entropy loss.We apply the AdamW optimizer (Loshchilov and Hutter, 2019) with β 1 = 0.9, β 2 = 0.98, ϵ = 1 × 10 −6 to improve training stability (Liu et al., 2019a).The weight decay coefficient is 0.1.For testing, we select the checkpoint with the highest validation performance.According to the results shown in Figure 2(a), we set the entropy threshold η as 4.0.The overall pipeline of our model is listed in Algorithm 1.
Since the tasks of fact checking, slot filling, dialogue, and open-domain QA are specially designed based on the knowledge from Wikipedia, we require the search engine to retrieve the top-1 passage from the website https://en.wikipedia.org.

C Supplementary Experiments
RealTime QA.Previous QA systems mostly assume that answers are static regardless of the time of query (Chen and Yih, 2020).In this section, we use the REALTIME QA benchmark (Kasai et al., 2022) to test models about real-time, instantaneous information.At each week, REALTIME QA will retrieve news articles and ~30 humanwritten, multiple-choice questions from news websites (CNN, THE WEEK, and USA Today), which covers diverse topics such as politics, business, sports, and entertainment.We adopt the origi- Is the possible answer: The possible answer is: If the model self-evaluate the possible answer is False, our model will leverage the search engine to access the web, otherwise not.We show the probability of predicting True depending on whether the model gets the question correct in Figure 3(a).However, according to Kadavath et al. (2022), this self-evaluation method is mainly suitable for question answering tasks with short-form answers but benefits less on question answering tasks with longform answers.Second, we consider using loss as the criterion to evaluate the model confidence.This approach is to generate a sample, and then look at the model's loss on this sample, averaged over all tokens, like the knowledge-intensive learning loss (Eq.3).If the loss for an example is higher than a threshold (e.g., 0.5), we consider that the model is unconfident about this example and we will query the web to retrieve knowledge.In Figure 3(b), we show the loss of samples that the model gets correct or incorrect.

D Case Study
In Table 9, we present three examples from Triv-iaQA (Joshi et al., 2017), CommonsenseQA (Talmor et al., 2019), and NumerSense (Lin et al., 2020).The first TriviaQA dataset is specially designed based on the knowledge from Wikipedia.Therefore, we can observe that Wikipedia contains the most relevant passage about the topic  ... 1.The game requires two players, X and O. 2. The game board is a set 3x3 grid in which players will place their symbol to claim that segment.3. X typically players first, then players alternate turns.4. The goal is to claim three segments of the grid in a row, either horizontally, vertically, or diagonally. 5.No additional sides can be added to the grid.6.The game is over either when one player achieves three segments in a row, or when the grid is filled without anyone achieving three segments in a row.... https://www.siammandalay.com/blogs/puzzles/how-towin-tic-tac-toe-tricks-to -always-win-noughtscrosses Table 9: Three qualitative example from TriviaQA, CommonsenseQA, and NumerSense.We present the top-1 retrieved passages from Wikipedia, CCNet, and web.The words in red denote the keywords related to the question.

Figure 1 :
Figure 1: Overview of our proposed web-augmented pretrained language model UNIWEB.

Algorithm 1
The pseudo code for UNIWEB.Require: A search engine (i.e., Google Search) connecting with the large-scale web 1: Input: Training data D 2: Output: Model parameters Θ 3: Initialize Θ 4: while not convergence do 5: for iteration = 1 to |D| do 6: Acquire an input-output pair ⟨X , Y⟩ ▷ Self-Evaluation 7:Compute the entropy H( Ỹ|X ) of the sampled output distribution (Eq.2) text Ỹ and compute the loss L 1 based on X and P (Eq. 3) ▷ Continual Knowledge Learning 14: Mask salient spans of P for the CKL pretraining and compute the loss L 2 (Eq.4) ▷ Model Optimization 15: Compute the gradients and update model parameters Θ based on L 1 and L 2 16:

Figure 3 :
Figure 3: (a) Probability of True for prompts in Hot-potQA; (b) Loss of samples in HotpotQA.

Table 2 :
Evaluation results on the test set for fact checking, slot filling, dialogue, and open-domain QA.We report Accuracy for FEVER, T-REx, and zsRE; EM for NQ, HotpotQA, and TriviaQA; ROUGE-L for ELI5 and F1-score for WoW.These results come from no-retrieval models (top section), Wikipedia/CCNet-based models (middle section), and Web-based models (bottom section).Bold and underline denote the best and second best methods.

Table 3 :
Piktus et al. (2021)t Accuracy on the dev set for commonsense QA, commonsense reasoning, and natural language inference (NLI).Bold and underline numbers denote the best and second best performance.FollowingPiktus et al. (2021), since it lacks the retrieval supervision to train DPR, we only report the BM25 results.

Table 4 :
Ablation study on five tasks.

Table 4
With France and Argentina set to battle it out on Sunday in the World Cup final 2022, which teams will go head to head for the third place?
. We can see that replacing the web with Wikipedia or CCNet suffers from a large performance drop.Besides, the self-Question:

Table 5 :
A qualitative example showing the top-1 retrieved passages from Wikipedia, CCNet, and web, and their corresponding model prediction.The words in red denote the keywords related to the question.

Table 6 :
The statistics of our 16 knowledge-intensive tasks.

Table 7 :
Task instructions for each task category.
Kadavath et al. (2022)confidence in task examples, we adopt the entropy as criterion in Section 4.2.1.In this part, we test with more kinds of criteria compared to the entropy followingKadavath et al. (2022).First, we consider a sample-enhanced prompting method, where we generate five samples with beam search and ask the model about the validity of the first sample with the highest score.We show an example at below: Self-Evaluation Criteria.
"US nuclear reactor accident in 1979".In addition, the web can provide another source of knowledge about this topic.Although CCNet covers this content, it does not give a clear answer to this question (i.e., full name of the US nuclear reactor).The second CommonsenseQA dataset involves questions related to commonsense knowledge going beyond Wikipedia.Therefore, Wikipedia can only provide a fuzzy description passage about "Guitar".The web and CCNet return diverse knowledge but the passage returned by search engine is more helpful.The thrid NumerSense dataset requires models to reason about the number.For the third example, CCNet provides a passage with incorrect information.While, the web and Wikipedia return passages about the rule of "tic-tac-toe", which can result in the correct answer "three".The Three Mile Island Unit 2 reactor, near Middletown, Pa., partially melted down on March 28, 1979.This was the most serious accident in U.S. commercial nuclear power plant operating history, although its small radioactive releases had no detectable health effects on plant workers or the public... https://www.nrc.gov/reading-rm/doc-collections/factsheets/3mile-isle.htmlQuestion: What do people typically do while playing guitar?Candidate Answers: A. cry B. hear sounds C. singing D. arthritis E. making music Gold Answer: singing Top-1 Wikipedia Passage Top-1 CCNet Passage Top-1 Web Passage ... The guitar is a fretted musical instrument that typically has six strings.It is usually held flat against the player's body and played by strumming or plucking the strings with the dominant hand, while simultaneously pressing selected strings against frets with the fingers of the opposite hand.A plectrum or individual finger picks may also be used to strike the strings... https://en.wikipedia.org/wiki/Guitar ... I was playing a brand-new game that had no rules and nothing established.I was really shy about it at first, because I hadn't looked out into the world to find other people who, of course, had done things like this.I heard Fred Frith play, and I knew he played his guitar with objects not typically associated with the guitar... https://www.premierguitar.com/articles/24026-janetfeder-prepared-for-all-genres ... Practicing the guitar regularly can enhance your concentration and expand your attention span.It takes an adequate focus to become an expert guitarist.Focusing becomes a habit for your mind and will help you concentrate better on other everyday chores too... https://www.chasingsound.com/posts/10-health-bene fits-of-playing-guitar Question: How do you win at tic-tac-toe get <mask> of your symbols in a row?Xs and Os (Canadian or Irish English) is a paper-and-pencil game for two players who take turns marking the spaces in a three-bythree grid with X or O.The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row is the winner... https://en.wikipedia.org/wiki/Tic-tac-toe ...You just make a 4x4 box instead of a 3x3 box.Then the same rules apply, only you need to get 4 in a row to win.When playing, does putting my symbol in the middle guarantee me winning?No.With both players playing optimally, the result is always a draw.How many X's and O's do I need to play tic tac toe on a board game?Since the board itself has nine spaces, I recommend that you have nine for both X's and O's...