Retrieval-Augmented Few-shot Text Classification

Retrieval-augmented methods are successful in the standard scenario where the retrieval space is sufficient; whereas in the few-shot scenario with limited retrieval space, this paper shows it is non-trivial to put them into practice. First, it is impossible to retrieve semantically similar examples by using an off-the-shelf metric and it is crucial to learn a task-specific retrieval metric; Second, our preliminary experiments demonstrate that it is difficult to optimize a plausible metric by minimizing the standard cross-entropy loss. The in-depth analyses quantitatively show minimizing cross-entropy loss suffers from the weak supervision signals and the severe gradient vanishing issue during the optimization. To address these issues, we introduce two novel training objectives, namely EM-L and R-L, which provide more task-specific guidance to the retrieval metric by the EM al-gorithm and a ranking-based loss, respectively. Extensive experiments on 10 datasets prove the superiority of the proposed retrieval augmented methods on the performance.


Introduction
Few-shot text classification, which entails learning a new task based on limited training data, has been advanced by pre-trained language models (PLMs) (Brown et al., 2020;Liu et al., 2023) and prompt engineering (Gao et al., 2021;Chen et al., 2022a).However, since training numerous parameters of PLMs on scarce data is prone to produce over-fitting (Liu et al., 2021) and unstable generalization, only using the trained parameters for inference usually leads to unsatisfactory performance on unseen test data.
On the other hand, retrieval-based methods have witnessed success on various natural language processing tasks, thanks to their capability of incorporating retrieved memory alongside parameters for better generalization.These methods retrieve relevant examples as memories from a large-scale corpus through either a static retrieval metric (Lewis et al., 2020;Wang et al., 2022) or a joint learningbased metric (Cai et al., 2021;Siriwardhana et al., 2023) and then the retrieved examples are used to make a prediction.In this way, their generalization ability is achieved by not only the model parameters but also the retrieved memory.
Despite the theoretical potential of promoting generalization by using retrieved memory, previous retrieval-augmented methods empirically struggle to showcase compelling ability in few-shot learning scenarios, where the retrieval space (i.e., the fewshot training data) is limited.Specifically, static retrieval may lack neighbors with high metrics in the case of limited retrieval space.Even though such neighbors exist, static retrieval cannot be reliable for retrieving really helpful samples for target prediction, because its metric is not task-specific.In particular, for joint learning-based retrieval which minimizes the standard cross-entropy based loss, although the retrieval metric is updated towards the downstream task, it suffers from the gradient vanishing problem during the optimization process as quantitatively measured in Fig. 2 (see §5.2 later).As a result, in a few-shot scenario, the retrieval metric might be not optimized well due to insufficient training data.
To overcome the aforementioned challenges, we propose two novel training objectives, namely Expectation Maximization-based Loss (EM-L) and Ranking-based Loss (R-L), for learning to retrieve examples from a limited space more effectively.Both objectives are committed to obviating the gradient vanishing problem and prioritizing more beneficial examples for specific downstream tasks.In the EM-L approach, the retrieved examples are treated as latent variables, and an iterative process of Expectation-step and Maximization-step is employed until convergence (Dempster et al., 1977).The posterior distribution of the latent variable is estimated to measure the importance of candidate examples in the E-step, while the M-step maximizes the expectation log-likelihood.By approximating the retrieval metric according to the posterior probability, more productive examples could be recalled for downstream tasks with limited training data.
Following a similar idea, R-L optimizes an additional ranking loss function to provide more direct supervision to the examples retriever, which draws inspiration from pair-wise ranking algorithm (Freund and Schapire, 1997;Burges et al., 2005;Rudin and Schapire, 2009).Such a tailored loss measures the consistency between the retrieval metric and the auxiliary function associated with each example for classification purposes.Minimizing the loss could effectively strengthen the supervision signals for the example retriever.
Our experimental evaluation on ten text classification datasets demonstrates the superiority of EM-L and R-L over existing retrieval methods within a limited retrieval space.The comparative analyses further confirm that EM-L and R-L alleviate the weak supervisory signals and gradient vanishing issue suffered by joint learning-based retrieval.Our contributions could be summarized as follows: • We discuss the weak supervision signals and gradient vanishing problem encountered by existing retrieval methods minimizing the standard crossentropy loss, as quantitatively measured in §5.2.
• We introduce two novel training objectives, namely EM-L and R-L, which optimize the retriever more effectively, thus recalling more productive examples from a limited space.
• Extensive experiments and analyses demonstrate that the proposed methods achieve better performance on few-shot text classification and alleviate the supervision insufficiency and gradient vanishing issues.
2 Revisiting Retrieval-augmented Methods in Few-shot Learning

Retrieval-augmented Methods
In this paper, we revisit the retrieval-augmented methods in few-shot text classification and formulate the task in a general framework.Our primary objective is to retrieve examples from limited training data to improve the few-shot text classification.
Model Formulation All retrieval methods could comprise an example retriever and a text classifier.We provide the formal formulation inspired by Singh et al. (2021) and Izacard et al. (2022): where x and z j denote the representations of original input and a retrieved example from the training set, and y corresponds to the class associated with input x. f clf and f retr serve as the text classifier and the example retriever, which selects examples according to a retrieval metric.θ and ϕ denote the trainable parameters of the text classifier and examples retriever.m is a hyperparameter that denotes the number of fetched examples.The operation ⊕ signifies concatenation, and the term softmax refers to the normalized exponential function.Specifically, z corresponds to a set of retrieval examples, which can either be {⟨x s , y s ⟩} pairs or {x s }.The latter form is adopted in this paper for simple experiments.
The standard cross entropy is employed to optimize the classifier and example retriever as follows: where n is the total number of training instances and y i is the gold label of the i-th instance.During inference, for all retrieval methods, we select top m examples according to P ϕ (z j |x) and get the final classification results using the first line of Eq. (1).
Static Retrieval Given an input sentence x and a retrieval corpus, static retrieval aims to search for a set of relevant examples Z according to a fixed retrieval metric (Borgeaud et al., 2022;Wang et al., 2022;Li et al., 2022).Following the Eq.(1), its retrieval metric is defined as follows: Here, sim(x, z j ) represents a fixed metric without any trainable parameters, such as TF-IDF (Sparck Jones, 1972), BM25 (Robertson et al., 2009), and semantic similarity encoded by PLMs.Such fixed metrics cannot adapt to the downstream task and prioritize the most helpful examples.Particularly, this limitation will be amplified in fewshot learning with scarce training data.
Joint Learning based Retrieval Static retrieval assumes that higher similarity between z j and x implies a greater auxiliary effect of z j on x.However, the assumption failed to hold in tasks where inputs with high similarity have distinct labels, such as sentiment classification.To address this limitation, joint learning-based retrieval (Cai et al., 2021;Gao et al., 2022;Siriwardhana et al., 2023) unifies the retriever and the downstream model to jointly train them for specific tasks.Following Eq. ( 1), f retr (x, z j ) is a trainable dot product attention.Notably, the absence of ground truth for P ϕ (z j |x) makes it challenging to determine which z j is the most beneficial one, and it relies implicitly on distant supervision from text classification.Both static retrieval and joint learning-based retrieval are proposed to retrieve examples from a large-scale corpus.In this paper, we mainly focus on few-shot text classification and retrieve the most helpful examples from the limited training set.

Challenges in Few-shot Learning
While the above retrieval-augmented methods have shown advancements in various natural language processing tasks, their performance in few-shot learning remains unconvincing.In other words, retrieving examples from a narrow space to improve few-shot learning is still challenging due to limited training data.Previous studies (Li et al., 2022;Siriwardhana et al., 2023) have revealed that static retrieval may not fetch the most helpful examples in tasks where similar inputs correspond to different labels, primarily due to their unreasonable assumption that higher similarity implies better suitability for the downstream task.Moreover, we also find static retrieval even underperforms methods without retrieval in some few-shot tasks (see Table 1).Such failure can also be attributed to data limitation in few-shot scenarios, where examples with high static similarities are scarce or non-existent.
In addition, joint learning-based retrieval methods (Ren et al., 2021;Cai et al., 2021;Siriwardhana et al., 2023) are good solutions to enhance the adaptability of the retrieval to downstream tasks.However, our study demonstrates that learnable metrics struggle to be trained as anticipated and are inferior to static metrics in several few-shot tasks (see Table 1).The main underlying factors are the scarcity of data and the weak supervision signals provided to the learnable retrieval metric.In more detail, the retrieval metrics in joint learningbased methods are adjusted solely based on distant supervision from the downstream tasks, which is significantly further weakened by the limited data.This fact is further supported by quantifying the gradient of retrieval parameters: the gradient norm of the parameters in retrieval metric is more than 1e − 6 for only about 40% updates in some datasets as shown in Figure 2 (see §5.2 later).
In this paper, our objective is to meet the challenges of weak supervision signals for the retriever and insufficient data, aiming to retrieve the most helpful examples to promote model generalization.

Overview
Given the limitations posed by limited data and weak supervision signals, existing retrieval methods are inadequate for addressing these challenges.To address these limitations, we propose two novel training objectives, which are achieved by two loss functions: Expectation Maximization-based Loss (EM-L) and Ranking-based Loss (R-L).Both methods aim to enhance the retrieval quality by giving the retriever more supervisory signals and prioritizing examples that are more beneficial for the specific task with limited training data.In essence, we seek to maximize the consistency between the metric distribution P (z j |x) and the classification distribution P (y|x, z j )[y i ] as much as possible.In this way, more suitable examples are retrieved and the performance of text classification could be improved even in the few-shot scenario.Additionally, we integrated EM-L, R-L, and two existing retrieval methods with two popular text classification backbones to compare their respective performance.

Backbone
Fine-tune Pre-trained Language Models For each sentence, we use PLMs to tokenize the input sentence into {[CLS], x 1 , ..., x l , [SEP]} with (l + 2) tokens and extract the representation x of [CLS] as the sentence embedding.In the same way, the j-th retrieved example is represented as z j .These tensors are subsequently fed into the example retriever and classifier, producing the final probability estimated for label y.
Prompt Learning Another backbone is to transform the text classification into a cloze question problem (Schick and Schütze, 2021).Let M be a masked language model with vocabulary V, and Y denote the label set of a specific downstream task A. Prompt learning employs a function P to convert an input sentence into a phrase containing a prompt with a [MASK] token.Then an injective function v : L → V is utilized to map each label to a word from M's vocabulary V. We first obtain the representation of [MASK] and determine the most suitable word from V for filling the [MASK].For instance, the application of prompt learning to sentiment classification can be outlined as follows: (5) where x is the representation of [MASK], g converts the probability of label words to classes, and l is sentence length.The representation z j of a retrieved example is yielded from a [MASK] token in the same way.

Expectation Maximization-based
Loss (EM-L) Considering the absence of the ground truth for P ϕ (z j |x) in Eq. ( 1), we regard z as a latent variable and propose an EM-based retrieval objective to estimate P ϕ (z j |x).This method alternates between an Expectation-step and a Maximization-step until convergence.In the E-step, the current parameters are used to estimate the posterior distribution of the latent variable given the observed data.Specifically, we retrieve m examples from the training set and compute the conditional probabilities of the latent variable using: where P θ (y|x, z j ) and P ϕ (z j |x) are obtained from classifier f clf and examples retriever f retr in Eq. ( 1) respectively.m denotes the number of retrieved examples.
In the M-step, the parameters are updated by maximizing the expected log-likelihood, which is taken with respect to the estimated posterior P θ,ϕ (z j |x, y) in the E-step: (7) Since we sample m examples from the training set by P ϕ (z j |x) and estimate P θ,ϕ (z j |x, y) based on m examples in the E-step, more supervision will be provided to the retriever during the optimization in the M-step.Please refer to Appendix A for proof of rationality of Eq.( 6) and why EM-L can minimize the likelihood-based loss defined in Eq. (2).

Ranking-based Loss (R-L)
Following the main idea claimed in § 3.1, Rankingbased Loss (R-L) considers the process of retrieving z j as a ranking task.Unlike EM-L, R-L employs a ranking loss to enhance the consistency between P θ (y|x, z j )[y i ] and P ϕ (z j |x) and provide more direct signals to the retriever.The optimization objective of R-L aims to ensure that z j with higher P θ (y|x, z j )[y i ] has higher P ϕ (z j |x) by minimizing the following L R : Here, P θ (y|x, z j ) and P ϕ (z j |x) are obtained from f clf and f retr in Eq. ( 1), m and n denote the number of retrieved examples and training instances.δ is a margin parameter imposing the distance between two distributions to be larger than δ.
The ranking loss L R is added to the overall loss L in Eq. ( 2) with a weight λ every t step: where λ > 0 is a hyperparameter to trade off both loss terms, and step denotes the training steps.Baselines To prove the effectiveness of retrieving examples from the training set, we develop a baseline method without retrieval for comparison.It comprises an input encoder described in § 3.2 and a feed-forward neural network for classification.For comparing different retrieval methods, we evaluated our EM-L and R-L against static retrieval and joint learning-based retrieval.We combine them with two widely used backbones for text classification: pre-trained language models fine-tuning and prompt learning.Please refer to Appendix C for more implementations, such as hyper-parameters and templates in prompt learning.
Evaluation.We evaluate all the retrieval methods using two metrics: Accuracy and Kendall's τ .Accuracy represents the proportion of correctly classified instances out of the total number of instances.Kendall's τ is employed to measure the consistency and correlations between the retrieval metric P ϕ (z|x i ) and its auxiliary P ϕ (y|x i , z)[y i ] for classification.Kendall's τ is defined as follows: where sign(•) ∈ {−1, 0, 1} is a sign function.A ranking pair ⟨j, k⟩ is concordant if their ranks have the same order in P ϕ (z|x i ) and P ϕ (y|x i , z)[y i ].Consequently, a positive τ i indicates a positive correlation between two distributions, and vice versa.
For n instances x i in the training set, we calculate the proportion of x i with τ i > 0 as follows: The reported Kendall's τ ′ in the following experiment is actually τ ′ , which represents the proportion of instances with τ i > 0.

Main Results
The experimental results for 16-shot setting on 10 datasets are reported in Secondly, the joint learning-based retrieval, EM-L, and R-L perform better than the static retrieval, which is even less effective than the vanilla model.We hold that this is because static retrieval fetches some examples with high semantic similarities but is detrimental to the downstream tasks.In contrast, the learnable retrieval methods, i.e. joint learningbased retrieval, EM-L, and R-L, are more likely to align with the goals of specific tasks.EM-L and R-L approaches train the retriever more effectively than static retrieval and joint learning-based retrieval.At first, our proposed EM-L and R-L achieve significantly higher accuracy across different backbones, proving their effectiveness in fetching helpful examples and adapting to specific downstream tasks.Furthermore, on average, R-L outperforms EM-L, potentially due to its utilization of a more direct ranking loss that provides more significant signals and flexible guidance to the example retriever.Finally, it is worth noting that EM-L and R-L show smaller standard deviations on most datasets than other methods, we conjecture that the proposed training objectives enhance the stability of generalization by incorporating retrieval memory alongside parameters.
The advantages of EM-L and R-L are more pronounced on challenging tasks, such as sentence pair classification, and aspect-based sentiment analysis.In this regard, EM-L and R-L achieve improvements of more than 0.3 on most datasets for sentence pair classification and ABSA, whereas the improvement on the single-sentence classification ranges from 0.1 to 0.2, which gain further highlights the effectiveness of EM-L and R-L.

Consistency Experiments
The Kendall's τ ′ defined in Eq. ( 11) on selected datasets are reported in Table 2, which measures the consistency between retrieval metrics of fetched examples and their auxiliaries to downstream tasks.

Auxiliary Experiment
We further conduct additional experiments in both 8-shot and full supervision settings to investigate the advantages of EM-L and R-L on different data scales.The results are presented in Table 3 and Table 4, respectively.It is obvious that EM-L and R-L consistently exhibit excellence in both settings.
Particularly, we note a more significant improvement of our methods in the 8-shot setting, which manifests that the proposed training methods train the retriever more effectively, especially when the training data is scarce.Moreover, another interesting phenomenon emerged: although EM-L and R-L achieve higher Kendall's τ ′ in the full supervision setting, their improvements in text classification are comparatively smaller compared to that in few-shot scenarios.We believe this can be attributed to the fact that the classifier in the full supervision setting is already well-trained so the potential improvement from a better retrieval memory is relatively limited.

Effects of the Number of Retrieved Examples
To examine the effects of the number m on various retriever training methods, we present line charts in Fig. 1 that depict the relationship between Accuracy and m.First, all the charts demonstrate retrieving examples could enhance the performance of few-shot text classification, except for a slightly lower accuracy of static retrieval and joint learningbased retrieval when m takes specific values.This could be attributed to the instability of their training process.Second, most methods achieve their peak performance at m = 5 or m = 10.As m continues to increase, the performance may start to deteriorate.We guess the reason is that retrieving too many examples increases the training difficulty.Third, we observe EM-L and R-L maintain sustaining advantages and stability as m varies, which verifies their stronger supervision signals.Another observation is that the joint learning-based method falls behind the static method on LAP.This finding suggests that in certain tasks, a poorly trained learnable metric even exhibits inferior performance compared to a static metric.

Gradient Updates
In order to assess the supervision signals exerted on the retrievers by different methods, we quantify the average gradients of all retrievers' parameters.This measurement allows us to evaluate the guidance provided by each method to the retriever during the training process.Fig. 2 illustrates the percentage of training steps where the average gradients of all retrievers' parameters exceed the threshold of 1e − 6.
For clarity, we exclude static retrieval from this figure since its retriever has no trainable parameters 1 .Our analysis revealed that on certain datasets, the gradient norm of the joint learning-based retriever exceeds the threshold of 1e − 6 for only about 40% of the steps, whereas EM-L and R-L surpass this threshold in over 60% of the steps.This observation suggests that both static and joint learning-1 This corresponds to a constant proportion of zero for steps with a gradient norm exceeding 1e-6.based retrieval provide weaker supervision signals to the retrievers and suffer from severe vanishing issues in few-shot text classification while EM-L and R-L alleviate such limitations.

Case Study
Finally, we present an illustrative example from the LAP dataset along with the retrieved examples using different methods in Fig. 3.In the input sentence, the aspect term "startup times" is negative.Although static retrieval fetches a semantic similar example, it includes information that could potentially mislead the sentiment prediction, such as the term "spectacular".The joint learning-based retrieval retrieves an example that seems unrelated to the input sentence, possibly indicating that weak supervision signals for the retriever are prone to worse retrieval results.In contrast, our EM-L and R-L methods are capable of retrieving examples that may not possess high semantic similarity but are more beneficial for sentiment prediction.
6 Related Work

Retrieval-augmented Methods
Retrieval-augmented methods enhance the ability of the Pre-trained Language Models in processing various natural language tasks by fetching relevant examples from the training set or external knowledge base and prepending them with the original input.These methods have improved the performance of a lot of tasks, such as neural machine translation (Zhang et al., 2018;Cai et al., 2021;Li et al., 2022;Wang et al., 2022), question answering (Li et al., 2020;Karpukhin et al., 2020;Singh et al., 2021;Wang et al., 2022;Siriwardhana et al., 2023;Li et al., 2023;Hofstätter et al., 2023), dialog generation (Fan et al., 2021;Thulke et al., 2021;King and Flanigan, 2023), text classification (Izacard et al., 2022;Lewis et al., 2020), keyphrase generation (Gao et al., 2022), etc.According to retrieval metrics, these methods could be categorized as static retrieval methods and joint learning-based methods, which use a fixed retrieval metric and jointly learnable metric respectively.Different from the above methods, which fetch relevant examples from the large-scale corpus, we propose two novel training objectives to retrieve examples in a restricted retrieval space and analyze their advantages.Following Singh et al. (2021); Izacard et al. (2022), we formulate the retrievalaugmented methods into a retriever and a classifier Figure 2: The proportion of steps in which the average gradient of retriever's all parameters is more than 1e−6.
in Eq. ( 1) for a fair comparison.

Conclusion
This paper studies the retrieval-augmented methods for few-shot text classification and demonstrates the challenges which hinder their success: it is impossible to retrieve semantically similar examples by using an off-the-shelf metric and it is difficult to optimize a plausible metric by minimizing the standard cross-entropy loss.Accordingly, it proposes two novel training objectives, EM-L and R-L, which provide stronger supervision signals to train the retrieval metric effectively in few-shot scenarios.It is worth mentioning that the idea of searching within limited examples bears similarity to the concept of demonstration selection in recent large language models (LLMs).Exploring the application of our methods in LLMs holds promise for future research.
Input: Startup times are incredibly long : over two minutes.The sentiment polarity of startup times was <mask> .

Methods Predictions Retrieved Examples
The internet speed is spectacular.The sentiment polarity of internet speed was <mask> .

Static positive
That included the extra Sony Sonic Stage software , the speakers and the subwoofer I got -LRB-that WAS worth the money -RRB-, the bluetooth mouse for my supposedly bluetooth enabled computer , the extended life battery and the docking port.The sentiment polarity of docking port was <mask> .

Joint positive
Its not just slow on the internet, its slow in general.The sentiment polarity of internet was <mask> .

EM-L negative
Another thing is that after only a month the keyboard broke and it costed $175 to send it in to fix it .The sentiment polarity of keyboard was <mask> .

Limitations
There are three primary limitations of our methods.Firstly, EM-L and R-L require additional training time compared to existing retrieval methods.It is due to the alternation between the E-step and M-step in EM-L and the optimization of an additional loss of R-L.Specifically, the training time for EM-L per epoch is approximately 1.5 times that of static retrieval and 1.2 times that of joint learningbased retrieval.Similarly, the training time for R-L per epoch is about 1.8 times that of static retrieval and 1.5 times that of joint learning-based retrieval.
Although our proposed methods require more time, they still fall within the acceptable range.Secondly, we didn't focus on designing more sophisticated templates for prompt engineering, as our main emphasis was on exploring different retrieval methods.Thirdly, we evaluate our methods in few-shot settings constructed from widely used datasets, rather than real-world scenes.This could limit the generalizability of our findings to practical applications.
The last step is according to Jansen inequality and equals if and only if Q(z j ) is proportional to P θ,ϕ (y, z j |x i ) and c is a constant.Such a proportional relationship can be expressed as: Since j Q(z j ) = 1, we can sum z on both sides of the equation: Now we can derive a lower bound of n i P θ,ϕ (y|x i ) by substituting c into Eq.(15) and then substituting Q(z j ) to Eq.( 14): Since P θ,ϕ (y, z j |x i ) = P θ (y|x i , z j )P ϕ (z j |x i ), we can further simplify Eq.( 18) as follows: Specifically, in the step denoted with *, and Further proof for convergence and equality of the original two optimizations is ordinary to derive as the proof of the EM algorithm, which is omitted here.

B.2 Few-shot Datasets
Following the few-shot setting of Gao et al. (2021)

C.1 Hyper-parameter Selection
We adopt grid search to choose the hyperparameters of different methods.Specifically, the learning rates are taken from {1e−5, 2e−5, 5e−5}, the batch sizes are from {4, 8, 16}, and the numbers of retrieved examples are taken from {5, 10, 15}.The parameter t that determines the update frequency of loss L R is searched from {5, 10, 15}.
The loss coefficient λ in ranking-based loss is set to {0.5, 1, 2}.For each dataset, we set the max training steps as 800 steps and use early stopping to avoid over-fitting.In each trial, we validate the model in each epoch and save the best checkpoint.We adopt the AdamW optimizer and accumulate gradients for each batch.The code is imple-

Figure 1 :
Figure 1: Effects of the number m of retrieved examples.The results are average Accuracy on the validation set.

6. 3
Few-shot Text Classification Few-shot Text Classification trains a classifier with limited data for each class, which can also predict unseen classes.Existing studies for few-shot text classification encompass various approaches such as prototypical networks (Jake et al., 2017), XLNetbased methods (Zhilin et al., 2019), (Ro)BERT(a)based methods (Chen et al., 2020, 2022a), Patternexploiting training (Schick and Schütze, 2021), prompt tuning (Lester et al., 2021; Gao et al., 2021), etc.And common sub-tasks in text classification consist of intention classification, topic classification, sentiment classification, etc.We evaluate our methods on different text classification tasks, with a focus on adapting the idea of retrieval-augmented methods to the few-shot scenarios through the design of new training objectives.

Figure 3 :
Figure 3: Case Study."Input" denotes an input sentence from LAP, "Predictions" represents the predicted sentiment polarities of different methods, and "Retrieved Examples" is the fetched examples with the highest metric in the training set."Labels for Retrieved Example" denotes sentiment labels of the fetched examples.

Table 5 :
Dataset details.The column labeled "Train" represents the number of instances in the original training set, while "Test" denotes the number of instances in the test set.The "Type" column describes the task type associated with each dataset.
, we randomly select 16 or 8 examples from the training set to create 16-shot or 8-shot experiments.Specifically, we generate five distinct few-shot datasets using different seeds and train models on each of them.It is noted that we use consistent five seeds on different datasets and retrieval methods to conduct a fair comparison.The best model is chosen based on the validation results, and the av-erage evaluation scores on the original test set are reported.