In-context Learning for Few-shot Multimodal Named Entity Recognition

,


Introduction
Multimodal Named Entity Recognition (MNER) aims to identify named entities of different categories from the text with extra image assistance.Consider the example in Figure 1(a), we need to recognize three named entities from the text, The World Cup is the biggest soccer tournament in the world .
(a) MNER task."Suarez" (PER), "Barcelona" and "La Liga" (ORG), to finish the MNER task.Most existing methods commonly employ pre-trained models followed by fine-tuning to accomplish the MNER task and achieve superior performance (Lu et al., 2018;Yu et al., 2020;Zhang et al., 2021;Chen et al., 2022b).In terms of existing research efforts, their superior performance generally relies on sufficient annotated data, which is time-consuming and laborintensive.In addition, in practice, entity categories will continue to emerge rather than remain fixed.Therefore, it is impractical to define all entity categories in advance.
To address these issues, motivated by the fewshot Named Entity Recognition (FewNER) task that involves learning unseen entity categories from a small number of labeled examples (Fritzler et al., 2019), we extend the MNER task to the few-shot field, named the few-shot multimodal named entity recognition (FewMNER) task, which aims to locate and identify named entities for a text-image pair only using a small number of labeled examples.As illustrated in Figure 1 Further, to address the FewMNER task, we propose leveraging the powerful in-context learning (ICL) capability of the large language model (LLM).Specifically, we argue that this paradigm can provide a promising direction for solving the FewMNER task by learning from a few examples in context without training.However, there are three problems while solving FewMNER using the in-context learning paradigm: (i) For the FewM-NER task, each sample is represented by textual and visual modalities, while the input of LLM is limited to natural language.Thus, we first require seeking ways to convert visual modality into natural language form.(ii) The key to performing ICL is to select a few examples to form a demonstration context.Although there are some example selection studies (Chen et al., 2022a;Min et al., 2022) targeting text classification tasks, selecting some useful examples for ICL in multimodal scenarios has not been approached.(iii) In addition, good demonstration designing precisely is essential to obtain satisfactory performance.Unlike simple classification tasks, the task instruction and output format of MNER need to be constructed according to the extractive nature.
To apply ICL to solve the FewMNER task, we propose corresponding solutions to the abovementioned problems.First, we employ an image caption model (Wang et al., 2022a) to generate textual descriptions from images, which not only converts images into natural language form but also aligns image features into the text space.Second, for selecting examples, we design an efficient sorting algorithm based on image and text similarity ranks, which can mitigate the similarity bias caused by different modality models.Then, we utilize this algorithm to select top-k examples with the highest similarity to the current test sample.Third, the demonstration design consists of two parts: instruction construction and demonstration construction (Dong et al., 2023).The former aims to inform LLM about the current task.To provide more detailed information, we add the description of entity category meaning to the instruction.The latter is to define the demonstration template and order selected examples into the demonstration template.For the demonstration template, it consists of three components: image description, sentence, and output, where output is the label information.
Then, we pack selected top-k examples into the demonstration template in ascending order of similarity rank, such that the most similar example is nearest to the current test sample.Finally, we concatenate instruction, demonstration, and the test sample as the input and feed it into LLM to obtain the prediction output.
The contributions of this paper are as follows: • We are the first to extend the MNER task to the few-shot field and explore the potential of the in-context learning paradigm for this task.
• To adapt the in-context learning paradigm to the FewMNER task, we address three related problems and propose a framework to accomplish this task.
• Through comparison with previous competitive methods, our framework exhibits a significant advantage in this task.We also conduct extensive analysis experiments to reveal the impact of various factors on its performance and provide novel insights for future research.
2 Related Work

Multimodal Named Entity Recognition
Multimodal Named Entity Recognition (MNER) aims to discover named entities in the unstructured text and classify them into pre-defined types with the help of an auxiliary image.Existing studies could be divided into two categories: cross-modal interaction-based methods and image conversionbased methods.The former tends to carry out crossmodal interaction using an attention mechanism and to combine textual representation with image representation for MNER.For example, some studies (Lu et al., 2018;Moon et al., 2018;Zhang et al., 2018) first applied LSTM and CNN to extract text and image features, respectively.Then attention is adopted to fuse two modal features to derive textual representation in order to complete entity labeling.In addition to modeling the interaction between text and images, a few studies (Chen et al., 2022b;Zhang et al., 2021) leveraged the semantic correlation between tokens and object regions to derive the final token representations for MNER.The latter (Chen et al., 2021;Wang et al., 2022b) first aims at converting images and extracting textualized information from them such as captions in order to align image features to the text space.Then this textualized information derived from an The World Cup is the biggest soccer tournament in the world .image is concatenated with the input text to yield the final token representation for completion entity recognition.Despite their promising results, they generally depend on a large amount of annotated data, which is inadequate in generalizing the ability to locate and identify entities to unseen entity categories.

In-Context Learning
With the scaling of the pre-trained model from 110M parameters (Devlin et al., 2019) to over 500B parameters (Smith et al., 2022), the ability of the model has been greatly improved, especially the understanding ability, fluency, and quality of generation.Many studies have demonstrated that large language models (LLMs) have shown an in-context learning ability (Brown et al., 2020), which is learning from a few context examples without training.
Although various LLMs (e.g., GPT-3, ChatGPT) have been trained, they are all closed-source and only accessible internally or via paid API services.How to effectively utilize the in-context learning ability of LLMs is an important question.Recently, some studies (Sun et al., 2022;Hu et al., 2022;Zhang et al., 2022) treat LLMs as a service and utilize the in-context learning ability to finish the few-shot and even zero-shot tasks.

Task Definition
Given a text t and its correlated image v as input, the fewMNER task only applies a small number of labeled examples to detect a series of entities in t and classify them into pre-defined categories.Following most existing in-context learning work (Dong et al., 2023), we formulate this task as a generation task.A large language model M takes a generation sequence of the maximum score as prediction output conditioning on the context C. For the k-shot MNER task, C contains instruction I and k examples, where C = {I, s (v 1 , t 1 , y 1 ) , . . ., s (v k , t k , y k )}, s is demonstration template and {y 1 , . . ., y k } is a set of free text phrases as the label.Therefore, for the given test sample x = {v, t}, the prediction output ŷ can be expressed as: (1)

Retrieve Example Module
Previous works (Rubin et al., 2022;Liu et al., 2022) have demonstrated that selecting similar examples to the current test sample can enhance the performance of LLM.However, these methods only con-sider textual similarity scores, which are insufficient for the FewMNER task due to the multimodal nature of FewMNER.Besides, different modality models introduce bias (i.e., different similarity score distributions for text and image (Peng et al., 2018)).To this end, we propose an efficient selection method based on text and image similarity ranks, which can mitigate the bias described above.

Image Similarity Rank
Given where Then, we calculate the cosine similarity of the test image representation H v and the image representation of the whole candidate set V, and record the rank of each candidate image set D v . where

Text Similarity Rank
Given a test text t test , we first utilize the pretrained language model such as MiniLM (Wang et al., 2020) as text extractor to map text t test and candidate text set D t = {t 1 , t 2 , ..., t N } into a d wdimensional embedding: where H t ∈ R dw is the sentence representation of t test and T ∈ R N ×dw is the embedding matrix of D t .Then, we calculate the cosine similarity of the test text representation H t and the text representation of the whole candidate set T, and record the rank of each candidate text set D t . where

Sorting Based on Both Similarity Ranks
According to the similarity rank results of image and text modalities R v and R t , we sum two rankings and sort them to get the final ranking result. where where σ are the indices of top-k similarity ranking, and σ={σ 1 , ..., σ k }.

Demonstration Designing Module
Following the in-context learning paradigm (Dong et al., 2023), it consists of two parts: instruction construction and demonstration construction.

Instruction Construction
We use the definition of the MNER task as the instruction, which helps LLM understand the current task and is shown as follows: You are a smart and intelligent Multimodal Named Entity Recognition (MNER) system.I will provide you the definition of the entities you need to extract, the sentence from where your extract the entities, the image description from image associated with sentence and the output format with examples.
To provide more detailed information for LLM, we describe the meaning of each entity category as follows: 1.PERSON: Short name or full name of a person from any geographic region; 2.ORGANI-ZATION: An organized group of people with a particular purpose, such as a business or a government department; 3.LOCATION: Names of any geographic location, like cities, countries, continents, districts, etc; 4.MISCELLANEOUS: Name entities that do not belong to the previous three groups PERSON, ORGANIZATION, and LOCATION.
Finally, we concatenate the task and entity category definitions as instruction I.I = {task definition, category definition}.(12)

Demonstration Construction
As shown in the demonstration designing module in Figure 2, the demonstration template contains image description, sentence, and output.To obtain the image description, we employ the OFA model (Wang et al., 2022a) to convert images into text captions.The sentence is the original text input.The output is initially constructed by concatenating the entity and the category, taking the test sample in Figure 2 as an example, the initial output is "World Cup is miscellaneous.".However, this leads to disordered outputs, such as predicting categories that are not among the four predefined ones, despite the instruction specifying them.To address this issue, we adopt a dictionary-based format that explicitly defines the output structure as {"PER": [], "ORG": [], "LOC": [], "MISC": []}.We find that this approach effectively standardizes the output format1 .
Finally, the top-k selected examples are fed into the demonstration template in ascending order such that the most similar example is nearest to the current test sample.

Predict module
Given the instruction and demonstration, we concatenate them into the whole context C.Then, we feed context C and test sample {v test , t test } into LLM and select the most probable generated sequence as the predicted output.
Finally, we decode the prediction output ŷ into a list according to the dictionary format and complete the k-shot MNER task.

Dataset
We conduct experiments on two public multimodal named entity recognition (MNER) benchmark datasets, Twitter2015 and Twitter2017.Two MNER datasets are constructed by (Yu et al., 2020).Each example consists of a text and an associated image in the two MNER datasets.The statistics of two MNER datasets are shown in Table 1.

Model Settings
For the feature extraction model, we employ clipvit-base-patch322 and all-MiniLM-L6-v23 to embed each image and text as a 512-dimensional and 768-dimensional embedding, respectively.For the image caption process, we employ the ofa-imagecaption-coco-large-en4 model to generate image description.For LLM, we choose the gpt-3.5-turbo(i.e., ChatGPT) as the backbone of our framework.

Comparison Models
For fine-tuning methods, we adopt the following baselines: (1) UMT (Yu et al., 2020) For D all set, few-shot methods are similar to fine-tuning methods.Therefore, we use "-" to indicate that this is not a valid few-shot setting.To ensure the reliability of few-shot baselines, the results of ProtoBERT and StructShot are the average results of 100 runs.
DebiasCL (Zhang et al., 2023), which proposes a de-bias contrastive learning-based approach for MNER and studies modality alignment enhanced by cross-modal contrastive learning.
For the few-shot methods, we apply the following baselines: (1) ProtoBERT (Ding et al., 2021), which employs a prototypical network with a backbone of BERT encoder; (2) StructShot (Yang and Katiyar, 2020), which uses token-level nearest neighbor classification and structured inference.

Main Results
We report the main experimental results in Table 2 and draw the following conclusions.
(1) Our framework significantly outperforms the fine-tuning methods on D 10 , D 50 and D 100 sets (except for HVPNeT on D 100 set of Twitter2015).For example, in terms of F1, our framework outperforms UMT by 55.78% and 65.30%, UMGF by 55.23% and 67.46%, HVPNeT by 36.40% and 35.53%, and DebiasCL by 54.83% and 65.75% on D 10 set of two MNER datasets.These show that our framework effectively exploits the in-context learning potential of large language models.
(2) Compared with few-shot baselines such as ProtoBERT and StructShot, our framework achieves the best results in all few-shot settings (i.e., 2-shot, 4-shot, and 8-shot).This indicates that methods based on in-context learning are preferable in the FewMNER task, and thus exploring congenial methods based on in-context learning can lead to improved performance in this task.
(3) We observe that the performance of our framework improves as the size of D increases.This is because a larger retrieval set provides more opportunities for the test sample to find similar examples.
(4) Our framework still lags behind the finetuning methods under the D all set.

Ablation Study
To analyze the impact of instruction and demonstration on the performance of our framework, we conduct ablation experiments and report detailed  results in Table 3.The results reveal that our framework achieves the best performance when combining instruction and demonstration, which suggests that both components are beneficial.Compared with removing instruction, removing demonstration leads to more performance degradation.This shows that the demonstration is crucial for our framework.Furthermore, we also conduct ablation experiments on the way of selecting examples.Sorting based on the sum of image and text similarity ranks outperforms sorting based on the sum of image and text similarity scores (i.e., w/ score).This is because the latter does not account for the bias introduced by different modality models.

Analysis
Image Modality Analysis.To explore the effect of image description on the FewMNER task, we conduct experiments with single-modality (i.e., text) and multi-modality (i.e., text and image) and show results in Table 4.For a fair comparison, both settings use the same instruction and demonstration, but the single-modality setting discards the image description.We observe that the multimodality setting outperforms the single-modality setting, especially on the PER category.The reason is that the image caption model tends to generate sentences related to people, which provide useful cues for identifying PER category entities.5. We observe that the similar method achieves the best performance, followed by the random method, and the dissimilar method performs the worst.This indicates that selecting similar examples to form the demonstration is beneficial for the FewMNER task.
Impact of Examples Sort.To investigate the impact of example sort on performance, we utilize the same examples to compare three methods of sorting (i.e., ascending, descending, and random) in the 4-shot setting on D 50 set.The results are shown in Table 6.The ascending sort method, which places the most similar example nearest to the current test sample, outperforms the other methods.This suggests that ascending sort examples by their similarity can improve performance.The reason is that the most similar example can provide more relevant information for the current prediction.

Impact of the Number of Examples.
To explore the impact of the number of examples on perfor-  Recently, some works have attempted to explain the ICL capability of LLMs.Dai et al. (2023) interprets language models as meta-optimizers and views ICL as a form of implicit fine-tuning.This is consistent with our findings that more examples can enhance the performance, as more examples lead to more optimization steps.On the Twitter2015 dataset, we observe a similar trend as on the Twit-ter2017 dataset, but our framework achieves the best score with 16 examples.The reason is that increasing the number of examples may introduce more dissimilar examples.These fluctuations indicate that more examples can have a positive effect on performance if they are sufficiently similar.
Error analysis.In this section, we aim to analyze the factors that affect the performance of our framework.As shown in Table 7, we report the performance of four categories in 2-shot, 4shot, and 8-shot settings on D 50 set.We find that our framework performs poorly on MISC category, which is significantly lower than PER, LOC, and ORG categories.The reason is that MISC is a miscellaneous category, defined as name entities that do not belong to the previous three categories.The annotation of MISC category entities depends on the preference of annotators.Relying only on the in-context learning ability of LLM and a few examples is not sufficient to learn this preference.
Moreover, we analyze the boundary error and category error and perform a detailed analysis of wrong predictions.We classify wrong predictions into two types: boundary errors and category errors5 .We count the number of errors for each category and report results in Table 8.We observe that increasing the number of examples significantly reduces boundary errors.Specifically, comparing 2-shot with 8-shot, the latter reduces the proportion of boundary errors by 3.80% and 5.09% on two datasets, respectively.Besides, increasing the number of examples does not reduce category errors.This is an interesting finding and demonstrates that more examples mainly improve the boundary ability of ICL, rather than category ability.

Conclusion
In this paper, we formulate multimodal named entity recognition as a few-shot learning problem, named few-shot multimodal named entity recognition (FewMNER), to extend entity detection to unseen entity categories.To tackle FewMNER, we propose a framework based on in-context learning by addressing three problems.Experimental results show that our framework outperforms baselines in several few-shot settings.Moreover, we conduct analysis experiments and find that selecting similar examples, sorting them in ascending order, and using more examples improve the performance of in-context learning.We also perform error analysis and observe that increasing the number of examples reduces boundary errors but not category errors.These results provide novel insights for future work on the FewMNER task.

Limitations
Although the proposed framework significantly outperforms several strong baselines on the FewM-NER task, it suffers from the following limitations: • In the case of using the full amount of data, our framework still lags behind fine-tuning methods.We can utilize some data to domain fine-tune LLM before applying our framework, which may further improve the performance of in-context learning on the FewM-NER task.This is a direction for future efforts.
• Unfortunately, due to API limitations, we are unable to obtain results from the more powerful GPT4 model, which has a multimodal function.Further experiments and analysis are required.
We believe that addressing the above limitations can further improve the performance of our framework.
(b), the 2-shot MNER task aims to accomplish the MNER task based on two labeled text-image pair examples.

Figure 2 :
Figure 2: The architecture of our framework on the 2-shot MNER task.

Figure 2
Figure 2 illustrates the overall architecture of our framework for the 2-shot MNER task, which contains three main components: (1) Retrieve example module, which utilizes k-nearest neighbors of text and image to select examples.(2) Demonstration designing module, which includes instruction construct and demonstration construct.(3) Predict module, which applies a large language model to generate prediction results without training.
(Dosovitskiy et al., 2021)andidate set D, where D contains N text-image pair and D= {(v 1 , t 1 ), (v 2 , t 2 ), ..., (v N , t N )}.We first adopt the pre-trained vision model ViT(Dosovitskiy et al., 2021)to obtain the representation of the whole image, including test image v test and candidate image set D v = {v 1 , v 2 , ..., v N }: Compared with sorting based on the sum of image and text similarity scores, sorting based on both similarity ranks considers the bias introduced by different modality pre-trained models.Through analyzing the distribution of image similarity scores S v and text similarity scores S t , we observe that the image similarity scores are generally higher than the text similarity scores.Sorting based on both similarity ranks can effectively address this issue.Finally, we take topk examples with the highest similarity ranking as selected examples.

Table 1 :
Statistics of two MNER datasets.
work, we select the k examples from the entire set to perform the k-shot MNER task without training.

Table 2 :
Main experiment results to compare fine-tuning and few-shot baselines with our framework in 2-shot, 4-shot, and 8-shot settings on D 10 , D 50 , D 100 , and D all sets of Twitter2015 and Twitter2017 datasets.FT denotes the fine-tuning method.For D 10 set, results with -due to 4-way 2/4/8-shot setting more than number of D 10 set.

Table 3 :
Ablation study in 2-shot and 4-shot settings on D 50 set.I and D denote instruction and demonstration, respectively.w/ score indicates that examples are selected based on the sum of image and text similarity scores.

Table 4 :
Different modalities in 4-shot setting on D 50 set.

Table 5 :
Different selecting example methods in 4-shot setting on D 50 set.

Table 6 :
Different sorting example methods in 4-shot setting on D 50 set.

Table 8 :
Analysis of the wrong predictions in 2-shot, 4-shot, and 8-shot settings on D 50 set.N p , B., N cb , and C. denote the number of predictions, proportion of boundary errors, correct number of boundaries, and proportion of category errors on the FewMNER task, respectively.

Table 9 :
Different shot settings on D 50 set.