Retrieving Multimodal Information for Augmented Generation: A Survey

As Large Language Models (LLMs) become popular, there emerged an important trend of using multimodality to augment the LLMs' generation ability, which enables LLMs to better interact with the world. However, there lacks a unified perception of at which stage and how to incorporate different modalities. In this survey, we review methods that assist and augment generative models by retrieving multimodal knowledge, whose formats range from images, codes, tables, graphs, to audio. Such methods offer a promising solution to important concerns such as factuality, reasoning, interpretability, and robustness. By providing an in-depth review, this survey is expected to provide scholars with a deeper understanding of the methods' applications and encourage them to adapt existing techniques to the fast-growing field of LLMs.


Introduction
Generative Artificial Intelligence (GAI) has demonstrated impressive performances in tasks such as text generation (Ouyang et al., 2022;Chowdhery et al., 2022;Brown et al., 2020) and text-to-image generation (Ramesh et al., 2021a).Powered by their abilities in modality-specific tasks, the recent incorporation of multimodality (Driess et al., 2023;OpenAI, 2023;Huang et al., 2023b) has opened up possibilities for generative models to serve as general-purpose learners in different formats of information.
However, generative models suffer from inevitable limitations, such as hallucinations (Ye and Durrett, 2022), arithmetic difficulties (Patel et al., 2021), and lack of interpretability.Thus, a promising solution for generative models is learning to interact with the external world and retrieve knowledge in different formats, thus augmenting their generation abilities (Mialon et al., 2023).
Recently, there have been emerging studies focusing on retrieval-based approaches, which aim to provide generative models with more information.
Among them, most (Nakano et al., 2021;Guu et al., 2020) use textual information retrieved from the web or textual corpora.Although the textual format aligns with data used during pre-training and offers a natural medium for interaction, there is more world knowledge contained in other formats, such as images, videos, graphs, and audio.These types of information are often inaccessible, unavailable, or not describable in traditional textual corpora.
Recent advancements in Multimodal Large Language Models (MLLMs) (Huang et al., 2023b;OpenAI, 2023;Driess et al., 2023) have improved the capability to handle multi-format information of generative models, demonstrating the significant potential in augmenting generation with multimodal knowledge.This has resulted in an emerging trend of work that utilizes retrieval-based and multimodal techniques to effectively address the limitations such as hallucination and lack of interpretability.
In this survey, we review recent advancements in multimodal retrieval-augmented generation.Specifically, for each modality, there are often differences in retrieval and synthesis procedures, goals, and targeted tasks.Thus, we group relevant methods into different modalities, including image, code, structured knowledge, audio, and video.
For each modality, we review the previous work, the current state, and future challenges.For example, in the image domain, retrieval-augmented methods have been used to better ground visual question-answering (VQA) tasks (Chen et al., 2022a;Tiong et al., 2022) and generate more factual captions (Yang et al., 2023b;Yasunaga et al., 2022).In the code domain, retrieval-based works decouple logic and textual information, which results in more faithful and factual outputs (Lyu et al., 2023;Chen et al., 2022c).To enhance factuality, some methods (Thoppilan et al., 2022;Cheng et al., 2022) also retrieve grounding contexts from structured knowledge, such as tables and knowledge graphs.Moreover, there are emerging works in combining audio and video retrieval in generative models (He et al., 2022b;Bogolin et al., 2022).
We believe that the emergence of multimodal retrieval-augmented generation contains the solution to many current challenges.To encourage more future research in this domain, we analyze several promising future directions, including retrieval-augmented multimodal reasoning, building a multimodal knowledge index, and combining retrieval with pre-training.
As the direction of multimodal retrievalaugmented generation is emerging, we will continue to add new works and expand the scope of our current survey.

Multimodal Learning
Multimodal learning focuses on learning a unified representation for data from different modalities, e.g., text, images, audio, and video.Multimodal learning aims at extracting complementary information to facilitate compositional tasks (Baltrušaitis et al., 2018;Gao et al., 2020).With the fruitful progress made in computer vision (Dosovitskiy et al., 2021;Liu et al., 2021d), natural language processing (Lan et al., 2020;Lewis et al., 2020), and speech recognition (Baevski et al., 2020;Hsu et al., 2021), multimodal models that are capable of processing and integrating data from different modalities have been greatly improved.
Multimodal learning has numerous applications.For instance, multimodal learning can improve image recognition accuracy by analyzing images and videos in conjunction with textual descriptions in computer vision (Ju et al., 2022;Alayrac et al., 2022a;Jia et al., 2021;Radford et al., 2021b).Multimodal models can incorporate visual information from images or videos to enhance language understanding and generation (Zhou et al., 2020;Lei et al., 2021).It also has the potential to significantly enhance the performance of machine learning systems in different domains by allowing them to learn from and integrate multiple sources of information (Tsai et al., 2019;Acosta et al., 2022;Nagrani et al., 2021).
With the increasing availability of large-scale multimodal datasets (Elliott et al., 2016;Sheng et al., 2016;Duarte et al., 2021), multimodal pretrained models have been developed and showed promising results in various applications (Gan et al., 2022;Uppal et al., 2022).Using the successful Transformer-based architecture, large multimodal pre-trained models, such as VL-Bert (Su et al., 2020), SimVLM (Wang et al., 2021d), ALBEF (Li et al., 2021), and CLIP (Radford et al., 2021b) are highly effective at learning complex patterns and relationships in multimodal data.These large models can then be transferred to different downstream tasks including VQA, image captioning, and object detection.
Additionally, there has been growing interest in developing models that can generate output that incorporates multiple modalities of data.For example, DALL-E (Ramesh et al., 2021b) is fed with pairs of textual descriptions and corresponding images to learn the joint representations.It can generate highly creative and diverse images from even very complex textual descriptions.Similarly, VQGAN-CLIP (Crowson et al., 2022) can generate new images based on textual prompts, where the textual description is used to guide the generation of the image.It combines the CLIP model for image-text understanding with the VQGAN model for image generation.There is also potential to improve the performance of natural language processing models by incorporating visual information in language generation tasks (Lin and Byrne, 2022;Chen et al., 2022a).
Multimodal generative models have a wide range of applications, such as text-image generation, creative writing generation, and multilingual translation.They can also be used to produce new product designs or textual content including website content and documents.However, there remain challenges for multimodal generation models, such as access to a large amount of multimodal data, the network design that produces semantically meaningful outputs, the interpretability of the models, and related ethical issues.It is critical to address these challenges to realize the full potential of multimodal generative models and ensure the proper use of these models.

Retrieval-Augmented Generation
The idea of retrieval-augmented generation is popular nowadays in natural language processing (NLP), which has been a longstanding challenge in the field of artificial intelligence (AI).In the past, the primary research focus was on developing specialized frameworks for specific tasks (Chiu and Nichols, 2016;Liu et al., 2016;Ding et al., 2020;Qin and Joty, 2022a).In recent years, there has been a significant shift in approach towards utilizing powerful, general-purpose language models that can be fine-tuned or prompt-tuned for a wide range of applications (Devlin et al., 2019;Yang et al., 2019;Raffel et al., 2019;Lewis et al., 2019;Brown et al., 2020;Liu et al., 2021b;Qin and Joty, 2022b;Ding et al., 2022b;Qin et al., 2023a).Through pre-training on a large-scale unlabeled corpus, pretrained language models have shown significant improvement in a wide range of NLP tasks (He et al., 2021b;Liu et al., 2021a;Ding et al., 2022a;Qin et al., 2023b;Zhou et al., 2023).While this approach showed great potential, it is mainly applied to simple tasks such as sentiment analysis, which humans can easily accomplish without requiring additional knowledge or expertise (Lewis et al., 2020).
In order to address the difficulties associated with resolving knowledge-intensive NLP tasks, there exist primarily two approaches.The first approach involves pre-training on a knowledge base and storing the acquired knowledge within a PLM (Zhang et al., 2019;Liu et al., 2020;Wang et al., 2021a;Liu et al., 2022;Zhou et al., 2022b;Jiang et al., 2022).The benefit of this approach is that it leverages a single model.However, it has two significant disadvantages: Firstly, it is difficult to control what knowledge has been learned by the models; Secondly, parameter updates are required when new knowledge comes in.The second approach is to develop retrieval-augmented generation methods (Gu et al., 2018;Weston et al., 2018;Cai et al., 2019b;Lewis et al., 2020) by combining a retrieval-based component and a generative component (e.g.PLM, LLM, etc.).Specifically, we denote the generative model by f and input text by x.Traditional generative models focus on predicting output y by y = f (x).Denote the retriever module by g, and we could retrieve segments of information c r based on (parts of) the input x r ∈ x.Thus, the retriever can predict c r = g(x r ).Then, the retrieval-augmented generation can be formulated as: y = f (x, c), where c = { x r , c r } is a set of relevant instances retrieved from either the original training set or external datasets to improve response generation.The primary concept behind this approach is that c r can aid in generating a better response if it is similar or relevant to the input x r .The retrieval memory can be obtained from three sources: the training corpus, external datasets, and large-scale unsupervised corpus (Li et al., 2022a).
Retrieval-augmented generation has been applied to a wide range of downstream NLP tasks, including machine translation (Gu et al., 2018;Zhang et al., 2018;Xu et al., 2020;He et al., 2021a), dialogue generation (Weston et al., 2018;Wu et al., 2019;Cai et al., 2019a), abstractive summarization (Peng et al., 2019), knowledge-intensive generation (Lewis et al., 2020;Izacard and Grave, 2021), etc.For text retrieval, there exist two types of retrievers that can be used to augment an LM: dense and sparse (Mialon et al., 2023).Sparse retrievers (Robertson et al., 2009) use sparse bagof-words representations, while dense neural retrievers (Asai et al., 2022) use dense query and document vectors.Both types assess document relevance to a query, with sparse retrievers excelling at precise term overlap and dense retrievers being better at computing semantic similarity (Luan et al., 2021).Various works have proposed methods to jointly train a retrieval system with an encoder or sequence-to-sequence LM, achieving comparable performance to larger LMs that use significantly more parameters.These models include REALM (Guu et al., 2020), RAG (Lewis et al., 2020), and RETRO (Borgeaud et al., 2022), which integrate retrieval into existing pre-trained LMs, and Atlas (Izacard et al., 2022), which obtains a strong few-shot learning capability despite being much smaller than other large LMs.Recent works propose combining a retriever with chain-ofthought (CoT) prompting for reasoning to augment language models (He et al., 2022a;Trivedi et al., 2022).For example, Anonymous (2023) verifies the validity of CoT reasoning steps and retrieves relevant contexts to augment the generation of the uncertain ones.He et al. (2022a) generate reasoning paths using CoT prompts and retrieve knowledge to support the explanations and predictions.Trivedi et al. (2022) propose an information retrieval CoT approach for multi-step question answering, where retrieval guides CoT reasoning and vice versa.

Multimodal Retrieval-Augmented Generation
As there are different retrieval and synthesis procedures, targeted tasks, and challenges for each modality, we discuss relevant methods by grouping them in terms of modality, including image, code, structured knowledge, audio, and video.

Image
Incorporating image data with text information has long been a crucial research topic, as a considerable amount of world knowledge is stored exclusively in images.
Recent advances on pretrained models shed light on general image-text multi-modal models: Flamingo (Alayrac et al., 2022b) can generate comprehensive captions from input images.FIBER (Dou et al., 2022) proposes a two-stage visionlanguage (VL) pre-training strategy benefiting different levels of VL tasks.DALL-E (Ramesh et al., 2021a) and Parti (Yu et al., 2022) can generate images based on given text instructions.CM3 (Aghajanyan et al., 2022) models both text and image for its input and output.Blip-2 (Li et al., 2023) bootstraps language-image pre-training from offthe-shelf frozen vision and language models.
However, these models require huge computational resources for pre-training and large amounts of model parameters -as they need to memorize vast world knowledge, such as what chinchillas look like and where they commonly habitat.More critically, such models cannot efficiently deal with new or out-of-domain knowledge.To this end, multiple retrieval-augmented works have been proposed to better incorporate external knowledge from images and text documents.
Towards open-domain visual question answering (VQA), RA-VQA (Lin and Byrne, 2022) jointly trains the document retriever and answer generation module by approximately marginalizing predictions over retrieved documents.It first uses existing tools of object detection, image captioning, and optical character recognition (OCR) to convert target images to textual data.Then, it performs dense passage retrieval (DPR) (Karpukhin et al., 2020a) to fetch text documents relevant to target image in the database.Finally, each retrieved document is concatenated with the initial question to generate the final prediction, similar to RAG (Lewis et al., 2020).Besides external documents, PICa (Yang et al., 2022) and KAT (Gui et al., 2022) also consider LLMs as implicit knowledge bases and extract relevant implicit information from GPT-3. Plug-and-Play (Tiong et al., 2022) retrieves relevant image patches by using GradCAM (Selvaraju et al., 2017) to localize relevant parts based on the initial question.It then performs image captioning on retrieved patches to acquire augmented context.Beyond text-only augmented context, MuRAG (Chen et al., 2022a) retrieves both text and image data and incorporates images as visual tokens.RAMM (Yuan et al., 2023) retrieves similar biomedical images and captions, then encodes two modalities through different networks.
Apart from VQA, RA-transformer (Sarto et al., 2022) and Re-ViLM (Yang et al., 2023b) generate more factual captions by retrieving relevant captions.Beyond retrieving images and text documents before generating text, Re-Imagen (Chen et al., 2022b) leverages a multi-modal knowledge base to retrieve image-text pairs to facilitate image generation.RA-CM3 (Yasunaga et al., 2022) can generate mixtures of images and text.It shows that retrieval-augmented image generation performs much better on knowledge-intensive generation tasks and opens up new capabilities such as multimodal in-context learning.

Code
Software developers attempt to search for relevant information to improve their productivity from large amounts of available resources such as explanations for unknown terminologies, reusable code patches, and solutions to common programming bugs (Xia et al., 2017).Inspired by the progress of deep learning in NLP, a general retrievalaugmented generation paradigm has benefited a wide range of code intelligent tasks including code completion (Lu et al., 2022b), code generation (Zhou et al., 2022a), and automatic program repair (APR) (Nashid et al.).However, these approaches often treat programming languages and natural languages as equivalent sequences of tokens and ignore the rich semantics inherent to source code.To address these limitations, recent research work has focused on improving code generalization performance via multimodal learning, which incorporates additional modalities such as code comments, identifier tags, and abstract syntax trees (AST) into code pretrained models (Wang et al., 2021c;Guo et al., 2022;Li et al., 2022c).To this end, multimodal retrieval-augmented generation approach has demonstrated its feasibility in a variety of codespecific tasks, including: Text-to-Code Generation Numerous research studies have investigated the utilization of relevant codes and associated documents to benefit code generation models.A prominent example is RED-CODER (Parvez et al., 2021), which retrieves the top-ranked code snippets or summaries from an ex-isting codebase, and aggregates them with source code sequences to enhance the generation or summarization capabilities.As another such approach, DocPrompting (Zhou et al., 2022a) uses a set of relevant documentation as in-context prompts to generate corresponding code via retrieval.In addition to these lexical modalities, RECODE (Hayati et al., 2018) proposes a syntax-based code generation approach to reference existing subtree from the AST as templates to direct code generation explicitly.
Code-to-Text Generation Retrieval-based code summarization methods have been studied extensively.For example, RACE (Shi et al., 2022) leverages relevant code differences and their associated commit message to enhance commit message generation.Besides, RACE calculates the semantic similarity between source code differences and retrieved ones to weigh the importance of different input modalities.Another retrieval-based neural approach is Rencos (Zhang et al., 2020), which retrieves two similar code snippets based on the aspects of syntactic-level similarity and semanticlevel similarity of a given query code.These similar contexts are then incorporated into the summarization model during the decoding phase.This idea is further explored by Liu et al. (2021c), where retrieved code-summary pairs are used to augment the original code property graph (Yamaguchi et al., 2014) of source code via local attention mechanism.To capture the global semantics for better code structural learning, a global structure-aware self-attention mechanism (Zhu et al., 2019) is further employed.
Code Completion Recent advances in retrievalbased code completion tasks (McConnell, 2004) have gained increasing attention.Notably, Hashimoto et al. ( 2018) adapt the retrieveand-edit framework to improve the model's performance in code auto-completion tasks.To address practical code completion scenarios, ReACC (Lu et al., 2022b) takes both lexical and semantic information of the unfinished code snippet into account, utilizing a hybrid technique to combine a lexical-based sparse retriever and a semantic-based dense retriever.First, the hybrid retriever searches for a relevant code from the codebase based on the given incomplete code.Then, the unfinished code is concatenated with the retrieval, and an auto-regressive code completion generator will generate the completed code based on them.In order to address project relations, CoCoMIC (Ding et al., 2022c) decomposes a code file into four components: files, global variables, classes, and functions.It constructs an in-file context graph based on the hierarchical relations among all associated code components, forming a project-level context graph by considering both in-file and cross-file dependencies.Given an incomplete program, CoCoMIC retrieves the most relevant cross-file entities from its project-level context graph and jointly learns the incomplete program and the retrieved cross-file context for code completion.
Automatic Program Repair (APR) Inspired by the nature that a remarkable portion of commits is comprised of existing code commits (Martinez et al., 2014), APR is typically treated as a search problem by traversing the search space of repair ingredients to identify a correct fix (Qi et al., 2014), based on a redundancy assumption (White et al., 2019) that the target fix can often be reconstructed in the search space.Recent studies have shown that mining relevant bug-fix patterns from existing search space (Jiang et al., 2018) and external repair templates from StackOverflow (Liu and Zhong, 2018) can significantly benefit APR models.Joshi et al. (2022) intuitively rank a collection of bugfix pairs based on the similarity of error messages to develop few-shot prompts.They incorporate compiler error messages into a large programming language model Codex (Chen et al., 2021) for multilingual APR.CEDAR (Nashid et al.) further extends this idea to retrieval-based prompts design using relevant code demonstrations, comprising more modalities such as unit test, error type, and error information.Additionally, Jin et al. (2023) leverage a static analyzer Infer to extract error type, error location, and syntax hierarchies (Clement et al., 2021) to prioritize the focal context.Then, they retrieve semantically-similar fixes from an existing bug-fix codebase and concatenate the retrieved fixes and focal context to form the instruction prompts for program repair.

Reasoning over Codes as Intermediate Steps
While large language models (LLMs) have recently demonstrated their impressive capability to perform reasoning tasks, they are still prone to logical and arithmetic errors (Gao et al., 2022;Chen et al., 2022c;Madaan et al., 2022).To mitigate this issue, emerging research works have focused on using LLMs of code (e.g., Codex (Chen et al., 2021)) to generate the code commands for solving logical and arithmetic tasks and calling external interpreters to execute the commands to obtain the results.Notably, Gao et al. (2022) propose to generate Python programs as intermediate reasoning steps and offload the solution step to a Python interpreter.Additionally, Chen et al. (2022c) explore generating chain-of-thought (CoT) (Wei et al., 2022) for not only text but also programming language statements as reasoning steps to solve the problem.During the inference phase, answers are obtained via an external interpreter.Similarly, Lyu et al. (2023) propose Faithful CoT that first translates the natural language query to a symbolic reasoning chain and then solves the reasoning chain by calling external executors to derive the answer.Another example is Ye et al. (2023), which utilizes LLMs to decompose table-based reasoning tasks into subtasks, decouples logic and numerical computations in each step through SQL queries by Codex, and calls SQL interpreters to solve them (a process called "parsing-execution-filling").
LLMs of code are also known as good-structured commonsense reasoners, and even better-structured reasoners than LLMs (Madaan et al., 2022).As a result, prior studies have also investigated the idea of transforming structured commonsense generation tasks into code generation problems and employing LLMs of code as the solvers.One such work is CoCoGen (Madaan et al., 2022) which converts each training sample (consisting of textual input and the output structure) into a Tree class in Python.The LLMs of code then perform few-shot reasoning over the textual input to generate Python code, and the Python code is then converted back to the original structure for evaluation.Besides, the success of LLMs of code such as Codex in synthesizing computer code also makes them suitable for generating formal codes.Motivated by this, Wu et al. (2022) propose to prove mathematical theorems by adopting Codex to generate formalized theorems from natural language mathematics for the interactive theorem prover Isabelle (Wenzel et al., 2008).

Structured Knowledge
To increase factual grounding and reduce hallucinations, a promising direction is to incorporate more structured knowledge, such as knowledge graphs, tables, and databases.An open challenge in generative models is hallucination, where the model is likely to output seemingly plausible sentences that do not conform to the ground-truth facts.Researchers have denoted that language models, while only relying on internal knowledge (pre-trained weights), fail to recall accurate details when functioning as a knowledge base in question-answering tasks (Ye and Durrett, 2022;Creswell et al., 2022).Thus, A potential solution is to ground generation with retrieved structured knowledge.Structured knowledge, such as knowledge graphs, tables, and databases, often represents how knowledge from different domains is integrated.They could function as a reliant source of truth to enhance factuality.
As the format of structured knowledge departs from the natural texts seen by LLMs during pretraining, how to effectively retrieve and synthesize it for generation has been an open challenge.Xie et al. (2022) represent an early attempt, where all formats of knowledge, including tables, triplets, and ontology, are linearized into text format and fed into the LLM without retrieval.Such methods, however, are limited to the acceptable context length of the PLM and are often computationally expensive.Some works design task-specific queries to retrieve structured knowledge by fine-tuning.For example, Large language models such as LaMDA (Thoppilan et al., 2022) have adopted such techniques.During fine-tuning, it learns to consult external knowledge sources before responding to the user, including an information retrieval system that can retrieve knowledge triplets and web URLs.Li et al. (2022b) propose a unified dialog model that learns to query pre-defined databases with belief states, which is a list of triplets.
Graph embeddings are used in works such as Pramanik et al. (2021), where a context graph is built on-the-fly to retrieve question-relevant evidence from RDF datasets, including knowledge graphs, using fine-tuned BERT models.Similarly, Heterformer (Jin et al., 2022) retrieves relevant nodes from text-rich networks, such as academic graphs, product graphs, and social media.By combining GNNs and PLMs, it handles tasks such as link prediction and query-based retrieval.Some works treat the generative model (often large language models) as black-box and retrieve structured information without fine-tuning.For example, BINDER (Cheng et al., 2022) uses incontext learning to output designed API calls that retrieve question-relevant columns from tables.He et al. (2022a) retrieve from knowledge graphs, such as Wikidata and Conceptnet, based on reasoning steps obtained from the chain-of-thought (CoT) prompting (Wei et al., 2022).
By retrieving from relevant sources, the model not only improves its factual grounding but also provides the grounding contexts while generating, thus addressing interpretability and robustness concerns.
With the potential to handle all types of information expanded by recent advances in LLMs (Ope-nAI, 2023), we believe that there is much work to be done in this modality, which offers efficient solutions to factuality concerns.There are still many future challenges to be addressed.For example, there should be new designs for better retrieval systems that could promote efficient interactions suitable for diverse knowledge bases.Synthesizing this information correctly into the models is also an open challenge, where it is hard to decide which parts need augmenting in the textual outputs.

Audio
There currently exist several works that use audio information to augment generation.
When audio information is the input for the generation task, retrieval augmentation is explored to learn the audio and lyrics alignment through contrastive learning (He et al., 2022b), which results in a higher-quality generation of captions for music.Moreover, retrieval of key/value pairs from the external knowledge catalog is used for automatic speech recognition tasks (Chan et al., 2023).
In cases where audio information is the output, retrieval is applied in a music generation system with deep neural hashing that encodes the music segments (Royal et al., 2020).Audio-text retrieval is also applied to produce candidates in the process of pseudo prompt enhancement for text-to-audio generation (Huang et al., 2023a).Although there is a limited amount of research work which focuses on retrieval augmented generation tasks involving the audio, it could be a promising future direction (Li et al., 2022a).
It is worth noting that the audio modality is closely intertwined with other modalities.Therefore, recent advancements in audio-text retrieval techniques (Hu et al., 2022;Lou et al., 2022;Koepke et al., 2022) and uses of audio features for text-video retrieval (Falcon et al., 2022;Mithun et al., 2018) can benefit retrieval augmented generation tasks involving other modalities.

Video
Currently, very few works have explored video retrieval for generative tasks, e.g., video captioning.However, the recent studies on dense video representation learning can be useful when developing video knowledge-enhanced generative approaches in the future.Bogolin et al. (2022) propose a query bank normalization method for cross-modal textvideo retrieval.Besides, Cap4Video (Wu et al., 2023) and CLIP-ViP (Xue et al., 2022) are data augmentation frameworks that utilize the web-scale pre-trained knowledge to enhance text-video retrieval pre-training.Besides, some works also try to introduce fine-grained interaction between different modalities (Yang et al., 2023a;Wang et al., 2021b).However, these methods still own a significant gap to be the foundation of retrieval-augmented generation models due to the cost of building a video index for knowledge search.
4 Future Directions

Retrieval Augmented Multimodal Reasoning
The words of the language, as they are written or spoken, do not seem to play any role in my mechanism of thought.The psychical entities which seem to serve as elements in thought are certain signs and more or less clear images which can be "voluntarily" reproduced and combined.-Albert Einstein One potential application of multimodal information retrieval is multimodal reasoning.Lu et al. (2022a) first introduce ScienceQA, a large-scale multimodal science question dataset annotated with lectures and explanations.Based on this benchmark, Zhang et al. (2023) propose Multimodal Chain-of-Thought (Multimodal-CoT) which incorporates language and vision modalities into a twostage (rationale generation and answer inference) framework, surpassing GPT-3.5 by a large margin with a much smaller fine-tuned model.Similar to Zhang et al. (2023), kosmos-1 (Huang et al., 2023b) breaks down multimodal reasoning into two steps.It first generates intermediate content as the rationale based on visual information, and then uses the generated rationale to induce the result.However, both methods may have difficulties in understanding certain types of images (e.g., maps), which could be mitigated by retrieving relevant informative image-text pairs.We hope that future work can pay more attention to how to effectively and efficiently combine multimodal reasoning with multimodal retrieval.

Building a Multimodal Knowledge Index
In order to facilitate retrieval augmented generation, one of the most fundamental aspects is the building of a multimodal knowledge index.The goal of building a knowledge index is twofold: Firstly, the dense representation should support low storage, dynamic updating of the knowledge base, and accurate search.Secondly, it could enable faster search speed with the help of local sensitive hashing (Leskovec et al., 2014), which combats scaling and robustness concerns when the knowledge base is scaled up extremely.
Currently, the dense representation for text snippets has been widely studied for documents (Karpukhin et al., 2020b;Gao and Callan, 2021;Gao et al., 2021), entities (Sciavolino et al., 2021;Lee et al., 2021), and images (Radford et al., 2021a).Besides, there are also many studies optimizing dense representations in an end-to-end manner (Lewis et al., 2020).Nevertheless, few works (Chen et al., 2022a) have explored building a multimodal index at the same time for downstream generation, and are also limited in text and image.How to map a multimodal knowledge index into a unified space is still a long-term challenge.

Pre-training combined with multimodal retrieval
With the goal of better aligning the abilities to handle different modalities in a pre-trained model, there could be future work built on employing retrieval-based approaches during pre-training.Currently, there have been many methods that finetune the pre-trained generative model for retrieval.For example, LaMDA (Thoppilan et al., 2022) can call an external toolset for fine-tuning, including an information retrieval system, a calendar, and a calculator.Similarly, during fine-tuning, Toolformer (Schick et al., 2023) augments models with API calls to tools including a question-answering system and a Wikipedia search engine.During pretraining, if similar retrieval abilities are leveraged, the generative model would be able to interact with retrieval tools better.Thus, it could output more grounded information, provide rele-vant contexts to users, and update their information accordingly.When new information comes in, the generative model would be able to effectively retrieve from an up-to-date external base instead of relying solely on pre-trained weights.This advantage also expands to handling robustness in out-ofdomain questions.
To incorporate retrieval with pre-training, there remains the challenge of developing appropriate datasets labeled with retrieval-based API calls.To tackle this challenge, LaMDA (Thoppilan et al., 2022) uses labels developed by human annotators, which could be expensive to collect.Toolformer (Schick et al., 2023) uses a sampling and filtering approach for automatic labeling, which is inexpensive but could induce noise and bias.A potential solution is to use a neuro-symbolic approach such as Davoudi and Komeili (2021), which use prototype learning and deep-KNN to find nearest neighbors during training.

Conclusions
This survey reviews works that augment generative models by retrieving multi-modal information from external sources.Specifically, we categorize the current domain into enhancing with different modalities, including image, code, structured knowledge, speech, and video.As many pretained models call for an external module to handle different formats, they often require further tuning or a tuned external retriever to interact with.With the emergence of large multi-modal models, we believe that this survey could serve as a comprehensive overview of an emerging and promising field.Moreover, we hope it could encourage future research in the domain, including retrievalaugmented multimodal reasoning, building a multimodal knowledge index, and combining retrieval with pretraining.