RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models

Retrieval-augmented large language models (R-LLMs) combine pre-trained large language models (LLMs) with information retrieval systems to improve the accuracy of factual question-answering. However, current libraries for building R-LLMs provide high-level abstractions without sufficient transparency for evaluating and optimizing prompts within specific inference processes such as retrieval and generation. To address this gap, we present RaLLe, an open-source framework designed to facilitate the development, evaluation, and optimization of R-LLMs for knowledge-intensive tasks. With RaLLe, developers can easily develop and evaluate R-LLMs, improving hand-crafted prompts, assessing individual inference processes, and objectively measuring overall system performance quantitatively. By leveraging these features, developers can enhance the performance and accuracy of their R-LLMs in knowledge-intensive generation tasks. We open-source our code at https://github.com/yhoshi3/RaLLe.

In comparison to closed-book settings where language models generate answers without retrieval, R-LLMs (open-book settings) enable the retrieval of relevant information from external databases or corpora (Mialon et al., 2023), which has led to improved accuracy in open-domain QA (Shi et al., 2023).Additionally, R-LLMs can acquire extended features even without additional training, such as explicit references, relief from fact hallucination (Nakano et al., 2021), and easy updates to the knowledge source (e.g., Guu et al., 2020).
Retrieval-augmented generation needs further research and development to reach its full potential.For example, even though the retriever-reader system has been trained on the Natural Questions (NQ) dataset (Kwiatkowski et al., 2019), its F1 score on the short answer task is 68.3 and still lags behind the oracle F1 score of 75.7 (Asai and Choi, 2021).This implies that further improvements can be made to the retrieval-augmented generation approach.Additionally, users would be probably aware that the outputs generated by R-LLMs may contain factual errors, particularly when applied to knowledge-intensive tasks.However, there is currently a lack of accessible evaluation framework to assess their output quality.This makes it difficult to identify areas for improvement.
Furthermore, having effective tools for developing R-LLMs is crucial.These tools should enable the design of inference steps such as retrieve-thengenerate, selecting the combination of retrievers and LLMs, evaluating the performance of the entire system, and testing the prompts used in each inference step.Currently available tools, such as the ChatGPT Retrieval Plugin 1 , Guidance 2 , and LangChain 3 (Chase, 2023), offer a high degree of abstraction, making it challenging to verify the functionality of individual inference steps or optimize prompts within each step.This lack of trans- Experimental setup and evaluation results can be tracked using MLflow.Additionally, a simple chat interface can be built to test out the best practices from the development and evaluation stages in a practical setting.
parency might hinder the optimization of R-LLMs.
In this paper, we propose RALLE, an accessible framework for Retrieval-Augmented Large Language model development and Evaluation.We also present evaluation results of several R-LLMs that we have constructed by using open-source retrievers and LLMs.To the best of our knowledge, RALLE is the first framework that empowers R-LLM developers and open-domain QA researchers to efficiently develop, evaluate, and improve R-LLMs using objective metrics.RALLE offers several key benefits: 1. Easy development and testing: users can easily select, combine, and test various retrievers and LLMs, especially open-source models, within a graphical interface.
2. Objective evaluation of R-LLMs: RALLE provides reproducible experiments with objective benchmarks/metrics, enabling objective assessments of R-LLM performance.
3. Transparent prompt engineering: all inputs (prompts) and outputs of each action are visible to developers, allowing for easy exploration and optimization of the prompts.

RALLE Usage
Figure 1 presents an overview of the key features of the proposed framework 4 .The primary development process involves three stages: (1) embedding and indexing the knowledge source documents, (2) designing an inference chain consisting of an R-LLM with customized prompt templates for each action, and (3) benchmarking the developed R-LLM.

Document Embedding and Indexing
To begin, the knowledge source documents can be encoded using an arbitrary encoder model, such as a sparse or dense retriever.For efficient indexing of dense embeddings, several methods are available by default, including Faiss (Johnson et al., 2019), HNSW (Malkov and Yashunin, 2020), and DiskANN (Jayaram Subramanya et al., 2019).By default, an HNSW index is constructed with ef _construction = 128 (the size of the dynamic list for the nearest neighbors) and m = 32 (the number of links created for every new element during graph construction).

Chain Construction
Once the document embedding and indexing are completed, the retrievers (and the corresponding indices) and LLMs can be loaded via the Gradio 5 -based GUI (Abid et al., 2019) to establish an inference chain that comprises an R-LLM.This chain of actions enables users to design a pipeline for multi-step inference, such as [retrieve]-[generate], or more intricate workflows such as [rewrite query]-[retrieve]- [generate] proposed in Ma et al. (2023).The versatility of this feature is especially beneficial in creating the chains tailored to specific use cases.
A single-action chain can function as either a simple retriever that returns the retrieved documents, or a closed-book QA that leverages the parametric knowledge of an LLM to provide answers without retrieval.In contrast, a chain with multiple actions that include retrieval enables retrievalaugmented generation or open-book QA, allowing an LLM to access external documents relevant to a question.Our default setup for R-LLMs consists of two actions: retrieve and generate.

Prompt Engineering
The RALLE framework allows developers to interactively craft customized prompt templates for LLMs and even for search queries on a per-chain basis.Each action can be executed independently, enabling precise control over LLM responses, such as specifying the desired output format or suppressing undesirable hallucinations.To enhance the versatility of prompt development, RALLE integrates support for f-strings and eval() function in Python.

Experiment Tracking
We utilize MLflow (LF Projects, 2023) to track the experiments, along with their associated configuration files and prompt templates.This allows us to compare the performance of different experiment runs objectively, which enables us to develop even better R-LLMs.

Chat AI
RALLE also provides support for building a simple chat interface.This enables users to test out best practices from the development and evaluation stages in a practical setting.

Experimental Settings
In this section, we evaluate the performance of R-LLMs constructed with several combinations of open-source retrievers and LLMs on knowledgeintensive tasks.

Tasks and Datasets
We employ KILT (Knowledge Intensive Language Tasks) benchmark (Petroni et al., 2021), an extensive benchmark that encompasses 11 datasets across five knowledge-intensive natural language processing tasks: fact checking, entity linking, slot filling, open-domain question answering, and dialogue (for further details of KILT, see Petroni et al. (2021)).We use the training sets for developing prompts and the development set for evaluation.
As the knowledge source, we utilize the preprocessed Wikipedia passages provided by KILT.The passages are derived from English Wikipedia articles based on the 2019/08/01 Wikipedia dump data, consisting of a total of 5.9 million articles and 22.2 million 100-word passages.For both dense and sparse retrievers, we use the set of 100word passages after additional pre-processing that prepends the title of the article to each passage.

Models
This subsection details the retrievers and LLMs employed to build R-LLMs in our experiments.RALLE allows practitioners and researchers to easily experiment with the most recent models available in open-source repositories.With the exception of BM25, all models are available from Hugging Face (Wolf et al., 2020) (see Appendix A.9 for the summary).

LLMs
The LLM used within the R-LLM must comprehend instructions provided in a prompt and generate appropriate responses based on the given information.To achieve this, we use instruction-tuned LLMs with a temperature parameter set to zero for optimal performance and reproducibility.

Retrievers
We experiment with both sparse and dense retrievers for document retrieval.Specifically, we select dense retrievers that have achieved high accuracy on the retrieval task of Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2023) leaderboard7 as of July 2023.A list of the retrievers used in our study can be found in Table 1.In the open-book experiments, the top-5 most relevant documents are retrieved.As the metrics of retrieval performance, we follow Petroni et al. (2021) and use the page-level R-precision (Craswell, 2016) and recall@5.The page-level R-precision is the percentage of R gold pages inside each provenance set among the top-R retrieved pages.Typically, R-Precision is equivalent to Precision@1 except FEVER and HotPotQA (multi-hop datasets).
BM25 (Robertson and Zaragoza, 2009) is a bag-of-words retrieval function based on the termmatching.We use the Pyserini (Lin et al., 2021) implementation of unigram BM25 with the default parameters of k 1 = 0.9 (term frequency scaling) and b = 0.4 (document length normalization).The documents for BM25 retrieval is the same 100word passages as the dense retrievers.

Prompts
We utilize custom-designed prompt templates that are specifically crafted for each dataset in KILT.RALLE accepts templates with non-natural language formats, such as f-strings and eval() functions in Python.This allows developers to carefully craft their prompt templates for optimal performance.The prompt templates used in our experiments are shown in Appendix A.10.
For entity linking task of KILT (AY2, WnWi, and WnCw), we employ a REWRITE-EL template by default for search queries.This template extracts the specific entity mentions being questioned as a query, as employing an entire span of a question is unlikely to find relevant documents (we will discuss in Section 4.3).After retrieving the relevant documents, the top-1 Wikipedia title is output as an answer.As a result, the downstream accuracy in entity linking task is not affected by the number of retrieved documents (if one or more).

KILT Benchmark Results
This section provides the downstream and retrieval performance of the R-LLMs developed and evaluated using RALLE.

Baseline
We compare our results with those of the BARTlarge model (Lewis et al., 2020a) for the closedbook setting and the RAG model (Lewis et al., 2020b) for the open-book setting, which presented in Petroni et al. (2021).Notably, these baseline models were specifically fine-tuned on the KILT benchmark, whereas our chosen LLMs and constructed R-LLMs were not.See also Appendix A.5 for additional information of the baselines.

Downstream Performance
We summarize the downstream performance 10 in Table 2. RALLE also includes has_answer percentage for short answers, a proxy metric to measure the proportion of questions that contain gold answers within the final output generated by an R-LLM (see Appendix A.2 for more details).
Our constructed R-LLM (e5 + Llama2-70B) surpasses the performance of the RAG model on both HoPo and TQA, despite not being fine-tuned with KILT like RAG.Moreover, our constructed R-LLMs demonstrate acceptable accuracy levels on other datasets as well, without any significant drawbacks.The results indicate that the LLMs used in this study exhibit certain ability to comprehend the retrieved documents.
Furthermore, our analysis reveals several factors that could contribute to improvement of downstream performance, including retrieval augmentation (except ELI5), increased model scale (except FEV and T-REx), and referring to more documents during generation (except NQ, HoPo, TQA and WoW).However, some datasets exhibits exceptions to these tendencies or had lower performance compared to their corresponding has_answer percentage (such as FEV, T-REx, NQ, and TQA).To address this issue, developers can improve the R-LLM with RALLE by refining the inference chain and the prompt templates.In Section A.4, we provide our initial attempts at developing inference chains with three actions on several datasets.
Overall, the downstream evaluation results provide valuable insights into how well the constructed R-LLMs perform on knowledge-intensive tasks, enabling developers to identify areas for improvement.

Retrieval Performance
Table 3 shows retrieval performance of the chosen retrievers on KILT development set (see also Table 8 in Appendix for the results of recall@5).According to Table 3, e5 (with Faiss Flat index) achieves the highest retrieval performance on average, though m-e5 is better on MTEB Retrieval task (Table 1).Despite the superior retrieval accuracy of e5 compared to RAG on KILT, the downstream performance of the R-LLM which employs e5 falls short of that of RAG (Table 2).This indicates that there is potential room for improvement through further optimized prompts to enhance the performance on a target dataset.
As described in Section 3.3, REWRITE-EL serves as the default template for search queries related to entity linking task (AY2, WnWi, and WnCw).As shown in Table 3, employing the REWRITE-EL template leads to higher retrieval accuracy when compared to using the full question text as a search query (− REWRITE-EL setting).This indicates that omitting unnecessary information from the search queries is helpful especially for entity linking task.

Speed Analysis
RALLE allows users to optimize the trade-off between latency (in seconds per question) and accuracy by comparing various configurations.As demonstrated in Table 4, employing approximate nearest neighbor search (ANNS) algorithms such as HNSW and DiskANN can significantly reduce retrieval latency at the cost of decreased accuracy.Note that, the optimal balance between speed and accuracy depends on the specific requirements of the application, and RALLE enables users to easily experiment with diverse ANNS settings to determine their impact on both factors.Notably, DiskANN achieves an accuracy that is only slightly lower than Faiss flat index while significantly improving search speeds, despite re-quiring less memory footprints than both flat and HNSW indices.Though the reduction in R-LLM execution time achieved through ANNS may appear relatively minor, the significantly lower DRAM requirements of DiskANN could make it a more practical solution for scenarios where DRAM capacity is limited and the flat index exceeds available DRAM capacity.For further details regarding latency, refer to Table 9 in Appendix A.8.

Conclusion
This paper introduces RALLE, an accessible framework for developing and evaluating R-LLMs.We also report evaluation results of several R-LLMs built using open-source retrievers and LLMs on knowledge-intensive tasks.Overall, RALLE offers a significant advancement in retrieval-augmented generation research, enabling efficient development, evaluation, and improvement of R-LLMs.We hope that RALLE will contribute to the development of best practices for R-LLMs.

Limitations
All KILT evaluations presented in this paper were conducted using a development set to maintain fairness and consistency across evaluations, as the answers of the test set remain confidential11 .
While R-LLMs exhibit high validity, it falls behind the smaller yet specialized model, RAG, on the KILT downstream task (refer to Table 2).This disparity can be attributed to various factors, including prompt maturity and the ability of LLMs to gen-erate responses.Although the employed prompts were carefully developed, it is likely that more optimal prompts exist (discussed in Section 4.3).Moreover, fine-tuning LLMs with retrieval-augmented generation tasks might enhance their performance on downstream tasks.Therefore, the evaluation accuracy reported herein would represent a conservative estimate.
Prompt engineering is a crucial aspect of the retrieval-augmented generation process, as the generated outputs can differ significantly between models, even when provided with the same prompt.RALLE offers an advantage in this regard, allowing users to effortlessly experiment with diverse prompts for varying behaviors, datasets, and intricate chain of actions.
In the realm of prompt development, techniques like Automatic Prompt Engineer (APE) (Zhou et al., 2023) automate the creation of prompts from input-output pairs and sampling to identify the most effective prompts.However, the input-output pairs in retrieval-augmented generation are distinctly different from those of the simple instruction induction tasks.Because the input text for retrievalaugmented generation can often be lengthy and complex, it is difficult to automatically induce the effective prompts from the input-output pairs.This tool enables developers to construct an inference chain with predefined actions, while recent advances have also introduced methods allowing LLMs to determine the actions (Yao et al., 2023).One approach entails retrieving documents using a query rewritten by an LLM and then summarizing them until the desired information is obtained.However, in our initial experiments (not described in this paper), we observed instances where relatively small LLMs (typically less than 100 billion parameters) became trapped in cycles of repeated retrieval and summarization, hindering their ability to reach the final answer generation.Our tool addresses this issue by intentionally building explicit inference chains to avoid unintended operations.

A.1 Computational Resources
The evaluation experiments are conducted on an Ubuntu 20.04.6 server equipped with Intel(R) Xeon(R) Gold 6326 CPU at 2.90 GHz CPU cores, and one node with 4×NVIDIA A100 Tensor Core GPU with 40 GB memory, and a RAID-5 array with a Dell(R) PERC H745 Front controller and KIOXIA(R) PM6-R SAS SSDs for storage.The CUDA version is 12.2, the Python version is 3.9.16, the PyTorch version is 2.0.1, and the Transformers version is 4.29.2.

A.2 Additional Metric: has_answer
RALLE also includes has_answer percentage (e.g., Karpukhin et al., 2020) for short answers, a proxy metric to measure the proportion of questions that contain gold answers within the final output generated by an R-LLM.By tracking this metric, developers can identify situations where the model generates responses that include gold answers but may be overlooked due to evaluation biases such as exact matching.This information can help refine prompts to improve overall performance.

A.3 Development Screen of RALLE
Figure 2 shows the chain development screen 12 .Developers can create an inference chain for an R-LLM on this Develop chain tab.One can choose a dataset and specify the desired chain length, which represents the total number of actions.By default, there are two actions: retrieving with a retriever and generating with an LLM.
Prompt templates for each action can be defined using f-strings or eval functions in Python.The results of applying the template can be confirmed without executing retrieval and generation.The execution result can be viewed by clicking the Interpret prompt and execute this action button.
The available action operators are LLM, Retriever, and Identity.LLM generates text based on the given prompt.Retriever retrieves the top k most relevant documents related to the input query.And Identity simply outputs the original prompt without employing a retriever or an LLM.
To execute the entire chain, click the Execute entire chain button.At the bottom of this tab, the selected question and its corresponding answer can be reviewed.Also, RALLE enables to highlight the gold answers within the retrieved documents or the output of the LLM, as well as highlight the Wikipedia ID of successfully retrieved provenance.

A.4 Attempts to Build 3-action Chain
According to Section 4.2, retrieval augmentation has a significant impact on performance in fact checking, open-domain QA for short answers, and slot filling tasks when comparing the closed-book and open-book settings of Llama2-70B.In entity linking task (AY2, WnWi, and WnCw), however, our approach described in Section 3.3 (retrieve, then output the top-1 retrieved Wikipedia title) may not be effective.
To improve the performance, we construct a 3action chain for AY2 dataset: (1) retrieve top-5 relevant documents, (2) explain the entity mention being questioned, and (3) predict the Wikipedia title based on the explanation and top-5 retrieved titles.Additionally, we explore developing 3-action chains for T-REx and NQ datasets, which involves (1) retrieval, (2) question rewriting, and (3) answer generation.Table 12 shows the prompts used in 3-action chains.
Table 5 shows the downstream performances with the 3-action chains on AY2, NQ, and T-REx datasets.While the 3-action chain outperforms the 2-action (retrieve-then-generate) chain on NQ dataset, it underperforms the 2-action accuracies on AY2 and T-REx datasets.This suggests that the 3-action chains constructed specifically for these two datasets require further optimization.However, the has_answer value for AY2 (70.0%) is higher than that of the 2-action chain (47.8%), indicating that incorporating post-processing steps into the 3-action chain (thus to be 4-action chain) could potentially boost accuracy, particularly for AY2.
One of the benefits of our tool is that it allows for easy definition of such additional inference actions.This means that developers can customize the chain to perform specific tasks beyond the default setting, giving them greater flexibility and control over their development.

A.5 Details of Baseline Model in Open-Book Setting
As a baseline in open-book setting, we present the results of the Retrieval-Augmented Generation (RAG) model (Lewis et al., 2020b) shown in Petroni et al. ( 2021), which achieved strong performance in the KILT benchmark.The RAG model comprises a bi-encoder retriever and a sequenceto-sequence generator (BART model (Lewis et al., 2020a)), both of which are trained end-to-end.The total number of trainable parameters in the RAG model is approximately 626 million.It is important to note that the RAG model was trained specifically for the KILT benchmark, whereas our chosen LLMs and constructed R-LLMs were not.

A.6 KILT Downstream Performances in Closed-Book Setting
Table 6 summarizes the KILT downstream results in a closed-book setting.The baseline (BARTlarge) model has been fine-tuned on the KILT datasets, while our chosen LLMs have not.Despite this, the LLMs demonstrate superior performance compared to the baseline on several datasets.Specifically, the Llama2-70B model outperforms the BART baseline on the zsRE and TQA datasets, and the Llama2-13B model outperforms the baseline on the ELI5 dataset.This suggests that the parametric knowledge embedded in the LLMs and their capacity for text generation can be leveraged effectively for knowledge-intensive tasks, even zero-shot setting.Nevertheless, as described in Section 4. the performance on downstream tasks, except the ELI5 dataset.We also present the closed-book performances of several LLMs on the development set of NQ dataset in Table 7.

A.7 Additional Results for Retrieval Performance
Table 8 presents the recall@5 of the retrievers used in our experiments.Note that even though m-e5 outperforms e5 on the MTEB Retrieval task (shown in Table 1), e5 still demonstrates superior performance compared to m-e5 in terms of both R-precision (shown in Table 3) and recall@5.

A.8 Details of Speed Analysis
Table 9 presents the details of speed analysis on KILT development set.The search speed of BM25 (without REWRITE-EL) decreases as the total number of words in a query increases.In contrast, for dense vector search, the search speed remains relatively constant regardless of the size of the query due to the fixed dimensionality of the embedding vectors.According to Table 9, the execution times re-quired for generation with an LLM is longer than the times required for retrieval, particularly when generating lengthy responses such as ELI5 and WoW.Therefore, it may seem counterintuitive that the advantages of ANNS used in vector search are not fully realized in terms of execution time of R-LLMs.However, as previously discussed in Section 4.4, DiskANN requires less memory compared to other vector search algorithms, which means that using such algorithm can actually help conserve computational resources for R-LLM.We observe that Llama2-13B requires more time to process each question compared to Llama2-70B.Upon further analysis, we discovered that the Llama2-13B model occasionally produced nonsensical responses such as multiple newline characters ("\n"), partially due to the limitations of our prompts.Llama2-70B model (in 4-bit) using Hugging Face Accelerate13 library.

A.10 The Prompts used in the Evaluation
Table 11 summarizes the prompts used in our experiment.Open-book indicates retrieve-then-generate setting.The queries used for retrieval are the raw questions without any rewriting, except for the REWRITE-EL settings of AY2, WnWi, and WnCw.
Closed-book indicates that an LLM answers to the given question without retrieval.Although these prompts have been our established best practices, we recognize that there may be opportunities for improvement (see also Section 5).Table 12: Prompt templates used in 3-action chains.

Figure 1 :
Figure 1: Overview of RALLE, our proposed development and evaluation framework for R-LLMs.Any number of actions can be defined for an R-LLM.Each action can be executed individually to test the corresponding prompts.Experimental setup and evaluation results can be tracked using MLflow.Additionally, a simple chat interface can be built to test out the best practices from the development and evaluation stages in a practical setting.

Figure 2 :
Figure 2: A screenshot of the Development chain tab of RALLE.Developers can create tailored action chains comprising multiple actions of inference.For each action, developers can specify a prompt template, confirm the results of applying the template, and execute the action using the newly defined prompt, individually.Moreover, RALLE can highlight the gold answers within the retrieved documents or the output of the LLM, as well as highlight the Wikipedia IDs of successfully retrieved provenance.

Table 2 :
Downstream performance on KILT dev set.Following Petroni et al. (2021), we report the results of typical metrics for each dataset, with bold indicating the best result and underlined indicating the second.The figures in parentheses represent has_answer percentage, which corresponds to the proportion of questions with gold answers included in the final output.The figures shown in gray are copied from the column above because they do not change based on the given setting.♢ : Results from Petroni et al. (2021).

Table 3 :
Retrieval performances on KILT dev set.We report page-level R-Precision on KILT development set.Avg.refers to macro-average of the retrieval scores in each dataset.Bold indicates the best result.♢ : Results from Petroni et al. (2021).*: BM25 (without REWRITE-EL) failed with long queries (45 out of 5,599 questions) in WnCw.

Table 4 :
Execution latency in seconds per question (sec/Q).Memory in Retrieval indicates the maximum (DRAM) memory footprints.

Table 5 :
2, retrieval augmentation can enhance Downstream performance of the 3-action chain on KILT dev set along with baselines.The figures in parentheses represent has_answer percentage, which corresponds to the proportion of questions with gold answers included in the final output of the LLM.♢ : Results from Petroni et al. (2021).

Table 6 :
Downstream performance on KILT development set in a closed-book setting (generation without retrieval).Following Petroni et al. (2021), we report the results of typical metrics for each dataset, with bold indicating the best result.The figures in parentheses represent has_answer percentage, which corresponds to the proportion of questions with gold answers included in the final output of the LLM.

Table 7 :
Table10, we utilize several opensource models from Hugging Face, specifically their officially released versions.We load the distributed models in 8-bit precision by default except Accuracies on NQ dev set in a closed-book setting.For gpt-3.5-turbo, the accuracy was calculated excluding five questions out of 2,837 questions in the NQ development set that were deemed inappropriate prompts by OpenAI and were not processed.

Table 8 :
Retrieval performances (recall@5) on KILT dev set.Avg.refers to macro-average of the scores in each dataset.Bold indicates the best result.The figures shown in gray are copied from the column above because they do not change based on the given setting.♢ : Results from Petroni et al. (2021).*: BM25 (without REWRITE-EL) failed with long queries (45 out of 5,599 questions) in WnCw.

Table 9 :
Execution time (in seconds per question) in RALLE.Avg.refers to macro-average of the times in each task.The mean query length and its standard deviation (shown as ± after the value) are also displayed, which were calculated using the e5 tokenizer.

Table 11 :
Prompt templates used in our experiments.The hook-left arrows ← refers to new line.Note that RALLE supports f-strings and eval() function in Python.

Table 11 -
continued from previous page.