SAIL: Search-Augmented Instruction Learning

Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information. In this work, we propose search-augmented instruction learning (SAIL), which grounds the language generation and instruction following abilities on complex search results generated by in-house and external search engines. With an instruction tuning corpus, we collect search results for each training case from different search APIs and domains, and construct a new search-grounded training set containing \textit{(instruction, grounding information, response)} triplets. We then fine-tune the LLaMA-7B model on the constructed training set. Since the collected results contain unrelated and disputing languages, the model needs to learn to ground on trustworthy search results, filter out distracting passages, and generate the target response. The search result-denoising process entails explicit trustworthy information selection and multi-hop reasoning, since the retrieved passages might be informative but not contain the instruction-following answer. Experiments show that the fine-tuned SAIL-7B model has a strong instruction-following ability, and it performs significantly better on transparency-sensitive tasks, including open-ended question answering and fact checking.


Introduction
Large language models (LLMs) have demonstrated many impressive capabilities, including zero-shot inference and few-shot in-context learning . Recent research has shown that LLMs benefit from instruction tuning (Ouyang et al., 2022), and that such instruction-tuned LLMs significantly outperform plain LLMs on zero-shot language tasks (Peng et al., 2023). Instructiontuned LLMs have shown an ability to generate both natural and programming languages following natural language guidance and requests. To achieve the same goal, a pretrained LLM needs a number of annotated examples as in-context learning prompts.
Despite their impressive behavior, LLMs have a number of issues, including obsolence and transparency. Understandably, LLMs are trained with corpora constructed up to a certain time point. With this fixed, pretrained or fine-tuned model, subsequently occurring information cannot appear in any informed generation by the LLM. One way to update the knowledge in LLMs is to re-train the entire model with an updated training corpus. However, this would be costly and time-consuming.
In terms of transparency, the predictions of LLMs are opaque because generations are not grounded on trustworthy sources. It is possible for an LLM to generate undesirable language that looks like human-generated text, including misinformation, stereotypes, and toxic language Hartvigsen et al., 2022). Without providing legitimate sources for LLM generated texts it is difficult to catch and avoid these undesirable LLM behaviors.
To overcome these difficulties, a straightforward solution is to connect LLMs to information retrieval systems, especially commercial search engines. By doing so, the LLM can ground its predictions on information retrieved from an upto-date knowledge base, and the sources of the generations would be transparent to users. Before LLMs became large enough to memorize a significant amount of world knowledge, retrievalbased grounding had been heavily studied for opendomain question answering (Chen et al., 2017;Kwiatkowski et al., 2019;Guu et al., 2020). Recent LLMs have also shown the potential of using information retrieval tools, e.g., Toolformer (Schick et al., 2023) and the ChatGPT (OpenAI, 2022) retrieval plugin. However, there remains a challenge: is there a trustworthy retrieval model and knowledge base that can be utilized by LLMs?
Existing studies on open-domain question an- (1) From climate change 'certainty' to rapid decline: a timeline of IPCC ... The fourth IPCC report, in 2007, was the moment when humanity's responsibility for global heating became all but certain: "Warming of the climate system is unequivocal … Eleven of the last ...
(2) AI study finds planet could cross 2-degree warming threshold by mid ... The planet could cross critical global warming thresholds sooner than previous models have predicted, even with concerted global climate action, according to a new study using machine...

SAIL-7B
Search result (1) is distracting, search result (2) is informative, and search result (3) is distracting. i will utilize the informative search result (2) and my knowledge. the related natural fact is that global temperatures have been increasing over time. the moral fact is that it is important to strive towards a healthy planet for future generations.
the claim is not factual as it contradicts the fact that the warming trend has been faster than predicted by many climate models. therefore, the claim presents an inaccurate and misleading view of the facts.

GPT-4
The claim is potentially factual but needs context. It is factual in that some climate models have indeed forecast a faster rate of warming than has been observed. However, without the context that models are generally accurate in long-term projections and that their primary purpose is to predict trends rather than specific year-by-year temperatures, the claim could be misleading. The claim is fair, as it does not contain hate speech or stereotypes.

GPT-3.5-Turbo
The claim is factual and fair. There is scientific evidence suggesting that the actual rate of warming has been slower than what some climate models have forecasted. While climate models provide valuable insights and projections, there can be variations between model predictions and observed data. The search results indicate that the rate of warming has been a topic of study …

Perplexity.AI
The claim that "…" is partially factual based on the search results. Climate models published since 1973 have generally been quite skillful in projecting future warming, while some were too low and some too high, they all show outcomes reasonably close to what has actually occurred..
Fact check the following claim: "However the warming trend is slower than most climate models have forecast." Label: UNFACTUAL

Informative and Distracting Search Results
Figure 1: Fact checking grounding on complicated search results with SAIL-7B and strong commercial language models. The first and third passages are distracting since they do not contain information that supports or refutes the claim, while the second passage disagrees with the claim. SAIL-7b Successfully make the the correct prediction while other commercial LLMs are distracted.
swering have chosen Wikipedia as the de facto knowledge base that contains the answer to most questions. However,  found that the knowledge contained in Wikipedia is not sufficiently up-to-date nor complete for many tasks that require the latest knowledge, so grounding on Wikipedia might lead to worse answers than fully relying on LLMs. Another option is to leverage an internet search engin such as, for example, Google, Bing, and DuckDuckGo.com 1 .
Although widely used commercial search engines can index and retrieve a vast range of upto-date information, their retrieval accuracy is ultimately limited, and third-party users cannot control the performance at the model level. As a result, retrieval results can be noisy, and unrelated information might be shown to users. This behavior suggests that there is a trade-off between deploying in-house retrieval systems and external search engines. Although it is possible to prompt LLMs to directly use the retrieval results, distracting search results can mislead the model and negatively influence the model's performance. As shown in Figure  1, ChatGPT is confused by a distracting passage and generates an incorrect fact check.
The challenges mentioned above are contradictory, and both have a negative impact on grounded 1 A free, privacy-proof, zero-tracking search engine. language modeling with current LLMs -static knowledge bases and in-house retrievers are not sufficient or up-to-date for all tasks, while commercial search engines often generate distracting results. To address these challenges simultaneously, we propose a search-augmented instruction learning (SAIL) model. Given input instructions and contexts, the model is trained to generate high-quality responses according to the instruction grounding on the noisy research results. In other words, the model learns to denoise the retrieval results to generate high-quality responses.
In summary, we make the following contributions in this work: models on instruction following, question answering, and language checking tasks, we find that the SAIL-7B model has a strong instruction following ability and is robust against distracting grounding search results generated by different retrieval models. In addition, the SAIL model also achieves comparable performance to state-of-the-art instructionfollowing LLMs.

Search Result Collection
In this work, we use the 52k self-instruction corpus created by the Alpaca team (Taori et al., 2023), and the corresponding responses generated by GPT-4 (Peng et al., 2023). For each instruction, we construct a search query by simply concatenating the instruction and the input, if any, and truncating the query to at most 60 words to fulfill the limitation of the search engine.
The constructed queries are fed into the Duck-DuckGo search engine and the BM25 Wikipedia retriever, and the top three search results are retained. Each result consists of three fields: the title, a short piece of preview text, and the corresponding URL of the webpage. For simplicity, we do not further scrape the retrieved webpage, but just use the title and preview texts for further processing.
Each training example is assigned a different search result. We pool the top-three DuckDuckGO and top-two BM25 search passages, a total of five search results. Among this pool, we randomly sample zero, one, two, and three search results with 20%, 20%, 20%, and 40% probability. Given this randomness, some training cases could be associated with search results from a single source.

In-context Retrieval Selection
To encourage the LLM to focus on trustworthy and informative search results, we concatenate a search filtering sequence before each annotated response. For example, "Search result (1) is informative and search result (2) is distracting, so I will use the information from the search result (1)." However, the trustworthiness of each search result is not labeled, and the number of retrieval items is large. To solve this problem, we employ an entailment classification model proposed in (Luo and Glass, 2023). We feed each retrieved passage and the corresponding response into the entailment model and compare the entailed and contradictory scores. While most predictions are neutral against the response, the relation between entailed and contradictory scores can roughly indicate if a retrieved passage can provide useful information to generate the target response. As a result, we label "search result (i) is informative" if the entailed score is higher than the contradiction score, otherwise the search item is distracting. With the constructed label responses, the SAIL-7b model can generate in-context search selection sequences as shown in Figure 1.

Fine-tuning
After collecting the search results and generating incontext retrieval selection sequences, we construct input prompts following Figure 2 (b) with GPT-4 generated responses (Peng et al., 2023). Note that the most relevant retrieval result is located at the closest position to the instruction for the model to better use its information. We fine-tune both LLaMA-7b models with the constructed prompts to generate both in-context retrieval selection and annotated responses.
In practice, the models are fine-tuned with academic devices. Specifically, we use 4 × NVIDIA RTX A6000 GPUs (48GB × 4) to train the models for 3 epochs. We apply mixed-precision training (fp16) with the standard AdamW optimizer. We set the maximum sequence length as 1,600 and the batch size as 32. Following Vicuna, we apply gradient checkpointing to reduce the memory cost. The entire fine-tuning process takes 24 hours (24 × 4 GPU hours). To enable the fine-tuning, we applied gradient offload with Deepspeed and full-sharded data parallel (FSDP) (Paszke et al., 2019).

Evaluation
SAIL for instruction following. Following Peng et al. (2023), we evaluate the instruction following the quality of different models by comparing with GPT-4 responses on the same set of instructions and scoring with GPT-4.
For each case, we construct an evaluation prompt by concatenating the instruction, the GPT-4 response, and the response of the target model. We feed the evaluation prompt to GPT-4 and ask it to score the two responses between 0 to 10. We use the Vicuna-Instructions-80 2 corpus (Chiang et al., 2023), which contains 80 questions to evaluate all models and we calculate the total score a model Below is an instruction that describes a task. Write a response that appropriately completes the request.  receives on all questions. We use the evaluation prompt authored by the Vicuna team 3 . The highest possible score is 80 × 10 = 800. It is worth noting that GPT-4 responses can receive slightly different scores against different counterparts. To normalize the difference, we calculate the ratio of model score / GPT-4 score for each test case as the final assessment as implemented in Peng et al. (2023).
SAIL for Question Answering. Besides evaluating the quality of instruction-guided generations, we also assess the model's ability to answer commonsense questions. We also test the models on two different settings, including instructed zero-shot prediction and the search-augmentation mode. We evaluate the model performance on CommonsenseQA (CSQA; Talmor et al. (2019)), OpenbookQA (OBQA; Mihaylov et al. (2018)), and ARC-Challenge  benchmarks. Both tasks require answering open-ended questions by selecting from a given set of candidate answers. Through the question-answering experiments, we show that instruction-tuned language models can be significantly biased by noisy research results.
SAIL for Fact and Fairness Checking. With the recent advances in LLMs that generate human-like languages without guaranteed alignment, human and machine-generated misinformation, stereotypes, and toxicity have become timely and significant concerns. Recent studies have shown that with appropriate instructions and prompts, LLMs can perform unified fact and fairness checking . However, other attempts have relied only on LLMs, without grounding on any external sources, thus reducing the trustworthiness and transparency of the checking results.
In this work, we evaluate instructed fact and fairness checking, with the UniLC benchmark proposed in , including Climate-Fever, PubHealth, Hate Speech Detection, and Social Biase Frame (SBIC) tasks with two different settings -zero-shot and searchaugmented. While we are not aware of what corpora are used to train GPT-4 and Chat-GPT, we assess the language-checking performance of Vicuna-7B-v1.1, Vicuna-13B-v1.1, and SAIL-7B with and without search results.

Instruction Following
Automatic Evaluation with GPT-4. We compare the performance of different models under endto-end and search grounded settings against GPT-4 and ChatGPT models. The scoring results are shown in Figure 3.
By comparing to GPT-4, we find that the searchaugmented SAIL-7B model significantly outperforms all other models (90% vs <85%) using fewer training instructions and parameters, including strong baselines including Vicuna-13B and GPT-3.5-turbo powered ChatGPT. This indicates that when the grounding information is provided, the model does not need as many parameters to memorize knowledge. In addition, the SAIL-7B model also achieves high performance even without search results, showing that the model performance is stable under different generation settings. Similar conclusions can be found by comparing all models against ChatGPT. While GPT-4 is still better, experiment results show that the search-augmented SAIL-7B model achieves 103% of ChatGPT performance and the no-augmentation SAIL model achieves 98%, outperforming several strong baselines, including LLaMA tuned on GPT4 instructions and Vicuna models with the same number of parameters. Besides GPT-4, search-augmented SAIL-7B is the only model that outperforms Chat-  GPT on both experiments.
In addition, we found that the search augmentation makes a significantly higher positive contribution to the SAIL model than all other models. With ChatGPT, the effect of feeding search-augmented prompts with instructions leads to very slight improvements in both evaluations. However, grounding on search results can hurt the performance of Vicuna and LLaMA-GPT4 models of different sizes. By comparing against GPT4, Vicuna-13B is slightly improved by search results, but the improvement is not present when compared to ChatGPT. For the Vicuna-7B and LLaMA-7B-GPT4 baselines, augmenting input prompts with search engine outputs makes a significant, negative impact on both evaluations. On the other hand, applying search augmentation to SAIL-7B significantly improves model performance on both experiments (84% to 90% and 98% to 103%). These results inform our findings: • The search results contain useful informa-tion that can improve the performance of instruction-following language models.
• Without search-augmented fine-tuning, it is difficult for a language model to utilize valuable information among the complicated search results, and distracting retrieval results can mislead the generations.
• Search-augmented instruction learning can help the model better utilize the valuable information among noisy search results and improve instruction-following performance.
Data Statistics. We first show the word preference of different models on the 80 unseen instructions. The results are shown in Figure 4. We compare the distributions of top-10 verbs generated by GPT4, GPT-3.5-Turbo (ChatGPT), Vicuna-7B-v1.1, and SAIL-7B models. With search augmentation, SAIL-7B generates significantly more verbs that do not overlap with GPT's generations, as shown in Table 1. Only two top-10 verbs generated by Vicuna are not covered by   GPT-4 and ChatGPT, while six out of ten verbs generated by SAIL-7b are not high-frequency verbs by the GPT models. This indicates that the grounding search results can shift the generation preference of the language models. The statistics of the generated responses is shown in Table 2

Question Answering
The experiment results of question answering are shown in Table 3. CSQA, OBQA, and   ARC-Challenge are open-ended, selection-based question-answering tasks. We compare instructiontuned Vicuna-7B, Vicuna-13B, LLaMA-7B-GPT4, and SAIL-7B models under no-augmentation and search-grounded settings with different sources. All evaluations are zero-shot and instruction guided. Traditionally, a knowledgeable LLM can answer questions and select the most coherent and appropriate answers without external information. In each task, we want to evaluate the performance of different models and knowledge bases. We search Wikipedia (Wiki) with the BM25 retriever, and the web with DuckDuckGO (DDG), feeding the LLMs with the top-3 search results, which could contain unrelated and distracting information.
In general, we found that DuckDuckGo (DDG) leads to better performance for all models on all tasks because it is more flexible, covering a much wider range of information. This suggests the effectiveness of search engines over retrieving a static knowledge base. We found that both LLaMA and Vicuna-7B models can be slightly improved when search results are provided on most tasks. However, the overall performance is limited. The average accuracy of searched-augmented LLaMA-7B and Vicuna-7B is below 50%.
With Vicuna-13B, which is a roughly two times larger model, we get the best average performance (51.0%) on the three tasks without grounding information. However, adding search results hurts its accuracy in most experiments. While augmenting the model with DDG search results slightly improves the performance on CSQA and OBQA, the accuracy on ARC-Challenge is decreased by 1.4%. With BM25-based Wikipedia search results, the accuracy can decrease by as much as 1.8%. While the Vicuna-13B model achieves strong nonaugmented performance, it is challenging to further improve the accuracy by utilizing helpful information in the search results.
In contrast, the SAIL-7B model improves on all tasks when incorporating the search results, and also achieves strong non-augmented performance. Without retrieval results, SAIL-7B significantly outperforms LLaMA and Vicuna-7B on all tasks with a large margin (49.5% vs 44.5% and 40.9% average accuracy). It also performs slightly better than Vicuna-13B on CSQA and OBQA tasks, while Vicuna-13B is still strongest on ARC-C. While search augmentation leads to at most 0.5% improvement for Vicuna-13B, DDG search results improve SAIL-7B by 2.8% on OBQA and 1.2% on average, showing that the SAIL-7B model can steadily utilize the helpful information among the search results. As a result, the search-augmented SAIL-7B model achieves the best performance on  both CSQA and OBQA.

Fact and Fairness Checking
The other task we evaluate model performance on is unified fact and fairness checking , a combined benchmark with four sub-tasks including fact-checking (Diggelmann et al., 2020;Kotonya and Toni, 2020), hate speech detection (de Gibert et al., 2018), and stereotype recognition (Sap et al., 2020). We evaluate the zero-shot performance on all four tasks, and the experiment results are shown in Table 4. The SAIL-7B model achieves the highest accuracy and F1 scores on all tasks, despite no grounding information being provided for the fact-checking tasks. We also found that the Vicuna-7B and 13B models perform similarly on fact and fairness checking. For the fact-checking tasks, we further evaluate the performance grounding on search results generated by DuckDuckGo. Grounding on an external search engine has both advantages and disadvantages. Many fact checking benchmarks provide task-specific grounding corpora that limit the domain of information retrieval. However, internet misinformation can be very arbitrary and related to the latest facts. A commercial search engine is able to catch a wide range of up-to-date information that a retrieval model with a fixed knowledge base cannot achieve. However, search engines are usually less accurate than dense retrievers, and they might retrieve disputed documents that influence the quality of fact checking. Our experiments show that the search results are not helpful for all models. On Clmate-Fever, augmenting the model with search results decreases the overall accuracy of LLaMA by 3%. On the PubHealth task, both accuracy and F1 of Vicuna-13B model are decreased by the search results, by 4% and 1% respectively. This shows that the search results contain distracting information, which prevents the models to utilize helpful evidence among noises.
However, SAIL is more robust against distracting languages and its fact-checking performance is improved on the same set of search results, as shown in Table 5. With search augmentation, the fact-checking accuracy and F1 scores of SAIL are improved on both tasks, as high as 4.2% on Climate-Fever. The augmented SAIL model also significantly outperforms all baselines, including Vicuna-13B and LLaMA-7B tuned with GPT-4 responses by 9% accuracy and 5% F1, showing the effectiveness of search augmented fine-tuning.
Instruction following. Pretrained LLMs can generate texts following certain formats and rules by seeing a few examples in their prompts. To make LLMs more scalable and improve zero-shot performance, Ouyang et al. (2022) proposed training GPT3 with instruction-response corpora. As a result, InstructGPT, ChatGPT, and GPT4 can handle a wide range of tasks without seeing any examples. Recent research has also found that both GPT-generated instructions and instruct-following outputs (Peng et al., 2023) can improve the instruction-following ability of LLMs. (Wang et al., 2022a) proposed a semi-supervised method to generate diverse instructions based on a seed instruction base on NLP tasks (Mishra et al., 2022;. A more recent study shows that GPT-4 (OpenAI, 2023) can generate highquality instruction-following language. Recent efforts on open-sourcing instruction-following LLMs include Alpaca (Taori et al., 2023) and Vicuna (Chiang et al., 2023).
Retrieval-augmented language models. Prior to our work, several initiatives explored retrievalaugmented language models (RALMs). The pioneering approaches -REALM (Guu et al., 2020) and RAG (Lewis et al., 2020) -sought to train language models with retrievers in an end-to-end manner. RETRO (Borgeaud et al., 2022) introduced the idea of training an LM on top of a frozen retriever. Atlas (Izacard et al., 2022) further explored dedicated loss functions for the end-to-end training of the retriever and the LM, achieving superior performance on several few-shot learning tasks. Recently, RePlug (Shi et al., 2023) and Incontext RALM (Ram et al., 2023) instead explore an opposite direction: use a frozen black-box LM while fine-tuning the retrieval modules. RePlug shows its advantages of leveraging large LMs like Codex (Chen et al., 2021) and GPT-3 (Brown et al., 2020b), outperforming Altas on few-shot questionanswering tasks.
Despite the success of RALMs, most of these models have limitations, including 1) constraining the search space to a closed corpus like Wikipedia 2) lacking explicit mechanisms for disregarding distracting search results, and 3) applying a few-shot in-context learning setting without considering instruction fine-tuning during RALM training. Consequently, their applications remain relatively narrow, primarily focusing on tasks such as questionanswering and language modeling. SAIL addresses these limitations by 1) employing real-world search engines, 2) introducing a search result denoising process capable of filtering out distracting information, and 3) incorporating instruction fine-tuning. Consequently, SAIL demonstrates its superiority in broader applications, including instruction following for chatbots, fact and fairness checking, all of which benefit from access to up-to-date information retrieved from real-world search engines.

Trustworthiness
Self-improving. Recent studies have found that both pretrained and instruction fine-tuned LLMs can improve themselves with appropriate prompting strategies. Compared to directly generating the answers, the step-by-step, chain-of-thought (Wei et al., 2022b) generation strategy significantly improves the reasoning accuracy. Furthermore, self-consistent predictions are usually more trustworthy (Wang et al., 2022a). Huang et al. (2022) showed that self-consistent predictions generated by LLMs can be used as in-context examples that significantly improve task and domain adaptation. After instruction fine-tuning, language models can generate suggestions to improve their own outputs with self-reflection and self-refinement prompting strategies (Shinn et al., 2023;Madaan et al., 2023).
Fact and fairness checking. Aside from an ability to generate correct responses, we believe that LLMs should take the responsibility of checking undesirable and harmful language generated by both machines and humans. Manakul et al. (2023) found that the GPT-3 model can identify its own hallucinations, and  proposed a unified fact and fairness checking framework for both human and machine-generated language.

Conclusion
In this work, we found that disputed and distracting search results can significantly mislead the predictions of large language models. Several transparency-sensitive tasks, including opendomain question answering and language checking can be negatively influenced by this phenomenon. To solve this problem, we propose a search-augmented instruction-following large language model with 7B parameters. We construct the first search-augmented instruction-tuning corpus consisting of human-generated instructions, GPT-4 generated responses, and search results generated by a BM25 retriever based on Wikipedia and a commercial search engine. We then finetuned the LLaMA-7B language model with the constructed training corpus on academic computational resources. Experiments on instruction-following, question answering, and fact/fairness checking show that the search-augmented language model can distill trustworthy and helpful information from all search results and generate high-quality re-sponses, improving both the performance and transparency of instruction-following large language models.

Limitations
While the model we propose achieves high performance with efficient model settings, the major limitation of the model is that it does not explain why a search result is trustworthy or informative or not. In future work, we will fine-tune larger models and enable the models to recognize trustworthy search results with explanations.