Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

The effectiveness of Neural Information Retrieval (Neu-IR) often depends on a large scale of in-domain relevance training signals, which are not always available in real-world ranking scenarios. To democratize the benefits of Neu-IR, this paper presents MetaAdaptRank, a domain adaptive learning method that generalizes Neu-IR models from label-rich source domains to few-shot target domains. Drawing on source-domain massive relevance supervision, MetaAdaptRank contrastively synthesizes a large number of weak supervision signals for target domains and meta-learns to reweight these synthetic “weak” data based on their benefits to the target-domain ranking accuracy of Neu-IR models. Experiments on three TREC benchmarks in the web, news, and biomedical domains show that MetaAdaptRank significantly improves the few-shot ranking accuracy of Neu-IR models. Further analyses indicate that MetaAdaptRank thrives from both its contrastive weak data synthesis and meta-reweighted data selection. The code and data of this paper can be obtained from https://github.com/thunlp/MetaAdaptRank.


Introduction
Text retrieval aims to rank documents to either directly satisfy users' search needs or find textual information for later processing components, e.g., question answering (Chen et al., 2017) and fact verification (Liu et al., 2020). Neural information retrieval (Neu-IR) models have recently shown advanced results in many ranking scenarios where massive relevance labels or clickthrough data are available (Mitra et al., 2018;Craswell et al., 2020).
The flip side is that the "data-hungry" nature of Neu-IR models yields mixed results in few-shot ranking scenarios that suffer from the shortage of labeled data and implicit user feedback (Lin, 2019; Yang et al., 2019). On ranking benchmarks with only hundreds of labeled queries, there have been debates about whether Neu-IR, even with billions of pre-trained parameters (Zhang et al., 2020a), really outperforms traditional IR techniques such as feature-based models and latent semantic indexing (Yang et al., 2019;Roberts et al., 2020). In fact, many real-world ranking scenarios are fewshot, e.g., tail web queries that innately lack large supervision (Downey et al., 2007), applications with strong privacy constraints like personal and enterprise search (Chirita et al., 2005;Hawking, 2004), and domains where labeling requires professional expertise such as biomedical and legal search (Roberts et al., 2020;Arora et al., 2018).
To broaden the benefits of Neu-IR to few-shot scenarios, we present an adaptive learning method MetaAdaptRank that meta-learns to adapt Neu-IR models to target domains with synthetic weak supervision. For synthesizing weak supervision, we take inspiration from the work (Ma et al., 2021) that generates related queries for unlabeled documents in a zero-shot way, but we generate discriminative queries based on contrastive pairs of relevant (positive) and irrelevant (negative) documents. By introducing the negative contrast, MetaAdaptRank can subtly capture the difference between documents to synthesize more ranking-aware weak supervision signals. Given that synthetic weak supervision inevitably contains noises, MetaAdaptRank metalearns to reweight these synthetic weak data and trains Neu-IR models to achieve the best accuracy on a small volume of target data. In this way, neural rankers can distinguish more useful synthetic weak supervision based on the similarity of the gradient directions of synthetic data and target data (Ren et al., 2018) instead of manual heuristics or trialand-error data selection (Zhang et al., 2020b).
We conduct experiments on three TREC benchmarks, ClueWeb09, Robust04, and TREC-COVID, which come from the web, news, and biomedi-cal domains, respectively. MetaAdaptRank significantly improves the few-shot ranking accuracy of Neu-IR models across all benchmarks. We also empirically indicate that both contrastive weak data synthesis and meta-reweighted data selection contribute to MetaAdaptRank's effectiveness. Compared to prior work (Ma et al., 2021;Zhang et al., 2020b), MetaAdaptRank not only synthesizes more informative queries and effective weak relevance signals but customizes more diverse and fine-grained weights on synthetic weak data to better adapt neural rankers to target few-shot domains.
The effectiveness of Neu-IR methods heavily relies on the end-to-end training with a large number of relevance supervision signals, e.g., relevance labels or user clicks. Nevertheless, such supervision signals are often insufficient in many ranking scenarios. The less availability of relevance supervision pushes some Neu-IR methods to freeze their embeddings to avoid overfitting (Yates et al., 2020). The powerful deep pre-trained language models, such as BERT (Devlin et al., 2019), also do not effectively alleviate the dependence of Neu-IR on a large scale of relevance training signals. Recent research even observes that BERT-based neural rankers might require more training data than shallow neural ranking models (Hofstätter et al., 2020;Craswell et al., 2020). Moreover, they may often be overly confident and more unstable in the learning process (Qiao et al., 2019).
A promising direction to alleviate the dependence of Neu-IR models on large-scale relevance supervision is to leverage weak supervision signals that are noisy but available at mass quantity (Zheng et al., 2019b;Dehghani et al., 2017;Yu et al., 2020). Through IR history, various weak supervision sources have been used to approximate querydocument relevance signals, e.g., pseudo relevance labels generated by unsupervised retrieval methods (Dehghani et al., 2017;Zheng et al., 2019b), and title-document pairs (MacAvaney et al., 2019). Recently, Zhang et al. (2020b) treat paired anchor texts and linked pages as weak relevance signals and propose a reinforcement-based data selection method ReInfoSelect, which learns to filter noisy anchor signals with trial-and-error policy gradients. Despite their convincing results, anchor signals are only available in web domains. Directly applying them to non-web domains may suffer from suboptimal outcomes due to domain gaps. To obtain weak supervision that adapts arbitrary domains, Ma et al.
(2021) present a synthetic query generation method, which can be trained with source-domain relevance signals and applied on target-domain documents to generate related queries.
More recently, a novel meta-learning technique has shown encouraging progress on solving data noises and label biases in computer vision (Ren et al., 2018;Shu et al., 2019;Zheng et al., 2019a) and some NLP tasks (Zheng et al., 2019a;Wang et al., 2020b). To the best of our knowledge, this novel technique has not been well utilized in information retrieval and synthetic supervision settings.

Methodology
This section first recaps the preliminary of Neu-IR and then introduces our proposed MetaAdaptRank. The framework of our method is shown in Figure 1.

Preliminary of Neu-IR
The ad-hoc retrieval task is to calculate a ranking score f (q, d; θ) for a query q and a document d from a document set. In Neu-IR, the ranking score f (·; θ) is calculated by a neural model, e.g., BERT, with parameters θ. The query q and the document d are encoded to the token-level representations H: where • represents the concatenation operation.
[CLS] and [SEP] are special tokens. The first token ("[CLS]") representation H 0 is regarded as the representation of the q-d pair. Then the ranking score f (q, d; θ) of the pair can be calculated as: f (q, d; θ) = tanh(Linear(H0)).
The standard learning to rank loss l i (θ) (Liu, 2009), e.g., pairwise loss, can be used to optimize the neural model with relevance supervision signals

Contrast Doc Pair
Meta-Reweighted Synthetic Signals * # " ' # ' Figure 1: The illustration of MetaAdaptRank, which first synthesizes massive weak supervision signals for target domains, and then meta-learns to reweight these synthetic data based on small target-domain relevance labels.
where d + i and d − i denote the relevant (positive) and irrelevant (negative) documents of the query q i . In few-shot ranking scenarios, the number of relevance supervision signals (M ) is limited, making it difficult to train an accurate Neu-IR model.
To mitigate the few-shot challenge in Neu-IR, MetaAdaptRank first transfers source-domain supervision signals to target-domain weak supervision signals (Sec 3.2); then meta-learns to reweight the synthetic weak supervision (Sec 3.3) for selectively training Neu-IR models (Sec 3.4).

Contrastive Synthetic Supervision
MetaAdaptRank transfers the relevance supervision signals from source domains to few-shot target domains in a zero-shot way. In this way, a natural language generation (NLG) model is trained on source domain relevance signals (Source-domain NLG Training) and is employed in target domains to synthesize weak supervision signals (Targetdomain NLG Inference). We will first recap the previous synthetic method ( Ma et al., 2021) and then introduce our contrastive synthetic approach.
Preliminary of Synthetic Supervision. Given a large volume of source-domain relevance pairs (q, d + ), previous synthetic method ( Ma et al., 2021) trains a NLG model such as T5 (Raffel et al., 2020) that learns to generate a query q based on its relevant document d + : where [POS] and [SEP] are special tokens. In inference, the trained query generator is directly used to generate new queries q * for target-domain documents d * , where d * is regarded as the related (posi-tive) document of q * , while the unrelated (negative) document can be sampled from the target corpus. Despite some promising results, the vanilla training strategy may cause the NLG model to prefer to generate broad and general queries that are likely related to a crowd of documents in the target corpus. As a consequence, the synthetic relevance supervision does not have enough ranking awareness to train robust Neu-IR models.
Source-domain NLG Training. To synthesize ranking-aware weak supervision, MetaAdaptRank trains the NLG model to capture the difference between the contrastive document pair (d + , d − ) and generate a discriminative query q: where [NEG] is another special token. The training instances (q, d + , d − ) can be obtained from source domains in which d + and d − are annotated as the relevant and irrelevant documents for the query q.
Target-domain NLG Inference. During inference, we first pick out a mass of confusable document pairs from target domains and then feed them into our trained contrastive query generator (Eq. 5) to synthesize more valuable weak supervision data.
To get confusable document pairs, we first generate a seed query q * for each target-domain document d * using the trained query generator (Eq. 4). Then the seed query is used to retrieve a subset of documents with BM25, where other retrieval methods can also be utilized. The confusable document pairs (d + , d − ) are pairwise sampled from the retrieved subset without considering their rankings. Given the confusable document pair, we leverage our trained contrastive query generator to generate a new query q : (6) where d + and d − are regarded as the related (positive) and unrelated (negative) documents of q . In this way, we can synthesize massive target-domain

Meta Learning to Reweight
The synthetic weak data inevitably contain noises.
To distinguish more useful training data for neural rankers, MetaAdaptRank meta-learns to reweight these synthetic data, following Ren et al. (2018).
, our meta-learning objective is to find the optimal weights w * on synthetic data to better train neural rankers. The learning of w * involves two nested loops of optimization: initialweighted synthetic data is used to pseudo-optimize the neural ranker; the weights is then optimized by minimizing the neural ranking loss on target data.
To be specific, the first loop (Meta-forward Update) incorporates the initial weights w into the learning parameters θ(w) instead of truly optimizing the neural ranker: where l j (θ) is the ranking loss on a synthetic instance (q j , d + j , d − j ). In the second loop (Metabackward Update), the optimal weights w * can be obtained by minimizing the target ranking loss: where l i (θ) is the ranking loss on a target instance . The calculation of each loop can be very expensive. In practice, we only perform onestep optimization in the two loops with mini-batch data, consistent with prior work (Ren et al., 2018).
Meta-forward Update. Taking the t-th training step as an example, we first assign a set of initial weights w = {w j } n j=1 to the synthetic training data batch and then pseudo-update the neural ranker's parameters to θ t+1 (w): where α is the learning rate. The description here uses vanilla SGD and other optimizers can be used. Meta-backward Update. We leverage the neural ranker θ t+1 (w) to calculate the ranking loss on the target data batch and obtain the optimal weights w * = {w * j } n j=1 through a single optimization step: where η is the learning rate for optimizing weights. The weights are further normalized for stable training. More details are shown in Appendices A.1.

Training with Meta-Weights
After obtaining the optimal weights w * , the optimization of the neural ranker is a standard backpropagation on the weighted loss of synthetic data: In each training step, MetaAdaptRank first learns to reweight the synthetic batch based on their metaimpact on the target batch and then updates the neural ranker with the weighted synthetic batch. In this way, the few-shot target data can serve more as a "regularizer" to help the neural ranker to generalize with synthetic data, instead of as direct supervision which requires more labels (Ren et al., 2018).

Experimental Methodology
This section describes our experimental settings and implementation details.
Datasets. As shown in Table 1, three standard TREC datasets with different domains are used in our experiments: ClueWeb09-B (Callan et al., 2009), Robust04 (Kwok et al., 2004, and TREC-COVID (Roberts et al., 2020). They are all fewshot ad-hoc retrieval datasets where the number of labeled queries is limited. We leverage the "Complete" version of TREC-COVID whose retrieval document set is the July 16, 2020 release of CORD-19 (Wang et al., 2020a), a growing collection of scientific papers on COVID-19 and related research.
Evaluation Settings. We evaluate supervised IR methods through re-ranking the top 100 documents from the first-stage retrieval with five-fold cross-validation, consistent with prior work (Xiong et al., 2017a;Dai and Callan, 2019;Zhang et al., 2020b). The first-stage retrieval for ClueWeb09-B and Robust04 is the sequential dependence model (SDM) (Metzler and Croft, 2005) released by Dai and Callan (2019), and the first-stage retrieval for TREC-COVID is BM25 (Robertson and Zaragoza, 2009) well-tuned by Anserini (Yang et al., 2017).
Metrics. NDCG@20 is used as the primary metric for all datasets. We also report ERR@20 for ClueWeb09-B and Robust04, which is the same with prior work (Zhang et al., 2020b), and report P@20 for TREC-COVID. Statistic significance is examined by permutation test with p < 0.05.
Baselines. Two groups of baselines are compared in our experiments, including Traditional IR Baselines and Neural IR Baselines.
Neural IR Baselines. We also compare seven Neu-IR baselines that utilize different methodologies to train neural rankers. In our experiments, all Neu-IR methods adopt the widely-used BERT ranker (Nogueira and Cho, 2019), BERT-FirstP, which only uses the first paragraph of documents.
The vanilla neural baseline only leverages the existing small-scale relevance labels of target datasets to train BERT rankers, which is named Few-shot Supervision. We also compare BERT rankers trained with two large-scale supervision sources: Bing User Click and MS MARCO. Dai and Callan (2019) train BERT rankers with 5 million user click logs in Bing. We borrow their reported results because commercial logs are not publicly available. MS MARCO is a human supervision source (Nguyen et al., 2016), which provides over one million Bing queries with relevance labels.
Four weak supervision methods are also compared. One baseline is Title Fitler, which treats filtered title-document pairs as weak supervision signals (MacAvaney et al., 2019) for training BERT rankers (Zhang et al., 2020b). Another two baselines are Anchor and ReInfoSelect. Anchor leverages 100k pairs of anchor texts and web pages to train BERT rankers (Zhang et al., 2020b). ReInfoSelect first employs reinforcement learning to select these anchor signals (Zhang et al., 2020b) and then trains BERT rankers. The   (Wolf et al., 2020).
For all Neu-IR methods, we first use additional supervision sources such as weak supervision signals to train BERT rankers (except for Few-shot Supervision); then fine-tune the BERT rankers with the training folds of target datasets in the crossvalidation. Following prior work (Dai and Callan, 2019;Zhang et al., 2020b), the ranking features ([CLS] embeddings) of BERT are combined with the first-stage retrieval scores using Coor-Ascent for ClueWeb09-B and Robust04. We set the max input length to 512 and use Adam optimizer with a learning rate of 2e-5 and a batch size of 8.
Contrastive Supervision Synthesis. We use the small version of T5 (60 million parameters) as the NLG models in MetaAdaptRank, and leverage MS MARCO as the training data for T5-NLG models. We set the maximum input length to 512 and use Adam to optimize the T5-NLG models with a learning rate of 2e-5 and a batch size of 4. In inference, the T5-NLG models are applied on target datasets with greedy search. Additionally, we consider CTSyncSup as our ablation baseline, which directly trains BERT rankers on contrastive synthetic supervision data without meta-reweighting.
Meta Learning to Reweight. The training folds of the target dataset are used as target data to guide the meta-reweighting to synthetic data. We set the batch size of synthetic data (n) and target data (m) to 8. The second-order gradient of the target ranking loss with regard to the initial weight (Eq. 10) is implemented using the automatic differentiation in PyTorch (Paszke et al., 2017).

Evaluation Results
In this section, we present the evaluation results of MetaAdaptRank and conduct a series of analyses and case studies to study its effectiveness.

Overall Accuracy
The ranking results of MetaAdaptRank and baselines are presented in Table 2. On all benchmarks and metrics, MetaAdaptRank outperforms all baselines stably. Compared to the best feature-based LeToR method, Coor-Ascent, MetaAdaptRank outperforms it by more than 15%. MetaAdaptRank even outperforms the strong Neu-IR baselines supervised with Bing User Click and MS MARCO, which demonstrates its effectiveness.
Specifically, CTSyncSup directly improves the few-shot ranking accuracy of BERT rankers by 3% on all benchmarks. In comparison to other weak supervision sources, filtered title-document relations, Anchor and SyncSup, CTSyncSup shows more stable effectiveness across different benchmarks, revealing its domain-adaption advantages. Moreover, meta-reweighting CTSyncSup brings further improvement and helps MetaAdaptRank outperform the latest selective Neu-IR method ReInfoSelect.
Next, we go ahead to analyze MetaAdaptRank's contrastive synthesis and meta-reweighting.

Effectiveness of Contrastive Synthesis
We analyze contrastive synthesis's effectiveness by its effect on ranking results and synthetic quality. Table 3 presents the ranking accuracy based on our CTSyncSup and four other supervision sources. CTSyncSup outperforms Anchor and SyncSup stably across all datasets. On Robust04, CTSyncSup even shows better performance than MS MARCO human labels. Besides, combining the sources of MS MARCO and CTSyncSup can further improve the ranking accuracy on ClueWeb09-B and TREC-COVID, revealing that CTSyncSup provides useful supervision signals applicable to various domains.
We further evaluate the quality of the queries generated in SyncSup and our CTSyncSup, which are both synthetic methods for generating queries based on target documents. Following previous research (Ma et al., 2021;Yu et al., 2020;Celikyilmaz et al., 2020), eight auto evaluation metrics are used in our evaluation. As shown in Table 4     The state of learned weights on CTSyncSup data from ReInfoSelect and MetaAdaptRank. We use a ClueWeb09 few-shot fold as target data. Training steps are marked on X-Axes. The mean and 95% Confidence Interval (CI) of data weights in the same batch are plotted. A 95% CI is an interval that will contain the true mean of weights with 95% probability. Its width is proportional to the standard deviation of data weights.
NLG model better approximate the golden queries. In addition, reversing the encoding order of the contrastive document pair causes a dramatic decrease in all evaluation scores of the generated queries. This further shows that our contrastive query generator can extract more specific and representative information from the positive documents, thereby generating more discriminative queries.

Effectiveness of Meta Reweighting
To analyze the effectiveness of meta reweighting, we employ MetaAdaptRank on different supervision sources and study its data weighting behaviors in the learning process. The reinforcement data selector ReInfoSelect is used as a comparison, which utilizes the trial-and-error weighting mechanism. The ranking accuracy of MetaAdaptRank and  ReInfoSelect trained with MS MARCO, Anchor, and CTSyncSup is presented in Table 5. For all supervision sources, MetaAdaptRank outperforms ReInfoSelect on all benchmarks. The results show that the meta-reweighting mechanism can more effectively explore the potential of different supervision sources compared to the trial-and-error weighting mechanism. Moreover, the advantages of meta reweighting can be extended to the hybrid supervision source of MS MARCO and CTSyncSup.
To further understand the behaviors of meta reweighting, we compare the state of weights assigned to synthetic supervision by MetaAdaptRank and ReInfoSelect in the learning process, using CTSyncSup as synthetic data and ClueWeb09 as target data. The results are shown in Figure 2. Even though each synthetic batch is likely to include both   Table 7: Cases of meta-reweighted contrastive synthetic data targeting ClueWeb09. The weights are marked in the parenthesis ↑ (more important) and ↓ (down-weight). The red texts are specific contents of positive documents and the blue texts are shared by both positive and negative documents. The document snippets are manually selected.
useful and noisy data points, ReInfoSelect always assigns very high weights at the beginning and discards almost all synthetic data points later. Besides, its tight confidence interval reveals that data points in the same batch received almost identical weights.
These observations indicate that ReInfoSelect does not effectively distinguish useful synthetic data points from the noisy ones during the learning process. By contrast, MetaAdaptRank assigns higher weights initially and steadily reduces the weights as training goes on. More importantly, its wide confidence interval reveals that the data weights in the same synthetic batch vary significantly, which are thus expected to be more diverse and fine-grained.

Effectiveness of Hybrid Supervision
We also analyze MetaAdaptRank's advantages on the hybrid supervision source of MS MARCO and CTSyncSup. The impact of the hybrid source on its ranking accuracy and meta-reweighting behavior is studied. Besides, we evaluate MetaAdaptRank trained with the hybrid source in Round 5 of the TREC-COVID shared task in which many strong baselines have been well-tuned for four rounds. Figure 3a shows the Win/Tie ranking accuracy of MetaAdaptRank trained with MS MARCO and the hybrid supervision source. Compared to the single MS MARCO, the hybrid source has more advantages across all benchmarks. Besides, the hybrid advantage seems to be more evident in non-web domain benchmarks, especially on TREC-COVID.
We further investigate the weighting behavior of MetaAdaptRank on MS MARCO and the hybrid source, using the same ClueWeb09 target data in previous analyses. Figure 3b illustrates the changes in meta-learned weights of randomly sampled 2k MS MARCO data points before and after merging CTSyncSup source. There are significant weight variations on most MS MARCO data points before and after merging CTSyncSup. Additionally, merging CTSyncSup reduces the weight of more MS MARCO data points, revealing that CTSyncSup data are assigned higher weights. This also reveals that MetaAdaptRank can tailor diversified weights for the same data points in different sources and up-weights more useful training data flexibly.
Lastly, we report the TREC-COVID R5 ranking results of MetaAdaptRank trained with the hybrid source. The top 2 automatic search systems in the R5 leaderboard are compared, which outperforms other systems on the newly added queries in R5. The evaluation of these new queries is fair to our methods and those systems that underwent previous rounds (R1-R4). As shown in Table 6, our single model outperforms the top 2 fusion-based systems on all evaluation of the new, old, and all queries, further showing the effectiveness of MetaAdaptRank with the hybrid supervision source. More details and ranking results are shown in Appendices A.2. Table 7 exhibits some cases of contrastive synthetic data for ClueWeb09 and their meta-learned weights. More cases are shown in Appendices A.3.

Case Studies
CTSyncSup can extract more specific contents from the positive documents, e.g., "shopping with the planet" and "make a big difference" in the first case; SyncSup captures more general information, e.g., "green energy". Compared to SyncSup's queries such as "where is jamestown beach" in the second case, the synthetic queries in CTSyncSup are more informative and discriminative. Noticeably, the second case exhibits the synthetic noise, where the positive document is actually related to "bermuda's tourism" instead of the query "history of bermuda". MetaAdaptRank effectively filters this noisy instance by assigning a zero weight to it.

Conclusion
This paper presents MetaAdaptRank, a domain adaption method for few-shot Neu-IR with contrastive weak data synthesis and meta-reweighted data selection. Contrastive synthesis generates informative queries and useful synthetic supervision signals. Meta-learned weights form high-resolution channels between target labels and synthetic signals, providing robust and fine-grained data selection for synthetic weak supervision. Both of them collaborate to significantly improve the neural ranking accuracy in various few-shot search scenarios.

A.1 Batch Normalization of Meta-Weights
This part elaborates the batch normalization process for meta-learned weights. Following prior research (Ren et al., 2018), we first set the initial weights w to zeros and obtain the new weightsw: Then we clipw to get non-negative weightsŵ and further normalize them in the batch to obtain the final weights w * : .
(13) Here δ( n p=1ŵ p ) = 1 when n p=1ŵ p = 0, to prevent division errors, otherwise it is 0. With the batch-normalization process, the hyperparameter η can be effectively eliminated. The normalization method is not constrained and other approaches can also be used (Shu et al., 2019;Hu et al., 2019).

A.2 Supplementary Results of TREC-COVID R5
This part supplements our evaluation results in the TREC-COVID R5 shared task. We will first recap the shared task and then present more evaluation results and our implementation details. TREC-COVID R5. The TREC-COVID Challenge is an ad-hoc ranking task for COVID-19 literature, consisting of five rounds. TREC-COVID R5 is the last round of this challenge, where the document set is the July 16, 2020 version of CORD-19, and the query set contains 50 testing queries. The first 45 queries have been used in previous rounds (R1-R4), and the last five queries are newly added in R5. As in previous rounds, TREC-COVID R5 adopts residual collection evaluation (Salton and Buckley, 1997). In residual collection evaluation, the relevance labels from previous rounds can be used, but any document that has been annotated for a query will be removed before the evaluation. We focus more on the evaluation of R5's new queries because these queries have no prior relevance labels, which is fairer to our models and those search systems that underwent previous rounds.
Evaluation Results. Table 8 shows the evaluation results on the new queries of TREC-COVID R5, including three variants of our MetaAdaptRank and the top 10 feedback systems in the R5 leaderboard. Compared with the top 10 feedback systems (many are fusion-based systems), our single model MetaAdaptRank (rerank fusion.2) outperforms all baselines, demonstrating the generalization ability of our method on new queries.
Additionally, what catches our attention is that the best and worst of the top 10 feedback systems only have a 5.1% difference in NDCG@20 scores on all queries, while their NDCG@20 scores on the new queries differ by 13.4%. This discrepancy indicates that the residual collection evaluation may have biases between the seen and unseen queries.
Implementation Details. We next describe the implementation details of the three variants of our MetaAdaptRank in TREC-COVID R5. Consistent with the implementation methods described in Section 4, we rerank the top 100 documents from the first-stage retrieval. We first borrow two retrieval results with different settings provided by Anserini BM25 (Row 7 and 8 of Table Round 5 1 ). Then PudMedBERT (Base) is used to rerank these two retrieval results to obtain MetaAdaptRank (rerank fusion.1) and MetaAdaptRank (rerank fusion.2), respectively. MetaAdaptRank (RRF) is the reciprocal rank fusion of these two models. We utilize the open-source library trec-tools (Palotti et al., 2019) to implement RRF and set the fusion weight k to 1.
To train MetaAdaptRank, we first synthesize CTSyncSup data based on R5's document set and leverage the hybrid source of CTSyncSup and MS MARCO as the additional supervision signals. The training process contains two stages. We first train -an interactive web portal based on emotion analysis of twitter data the covid-19 pandemic has affected many countries across the world, and disrupted the day to day activities of many people ... Table 9: The contrastive synthetic data reweighted by MetaAdaptRank, where the top 3 cases are from Robust04 (News) and the last 3 cases come from TREC-COVID (BioMed). Their meta-weights are marked in the parenthesis ↑ (more important) and ↓ (down-weight). The red texts are the specific contents of the positive documents, and the blue texts are mentioned in both positive and negative documents. The document snippets are manually selected.
MetaAdaptRank with the hybrid source and regard the labeled data from previous rounds (R1-R4) as target data in meta-reweighting. Then we continuously train MetaAdaptRank using the labeled data from the previous rounds. In the training processes, we utilize Adam optimizer with a learning rate of 2e-5. Both the batch size and the accumulation step are set to 8. In addition, to ensure a fair comparison with the submitted search systems, we post-process our results according to official guidelines. Table 9 shows more cases for the other two datasets, Robust04 (News) and TREC-COVID (BioMed), to verify the effectiveness of MetaAdaptRank in different domains. The first three cases are from Ro-bust04, and the rest cases are from TREC-COVID.

A.3 Supplementary Case Studies
For the first synthetic cases, our CTSyncSup can extract characteristic keywords, e.g., "radars" and "colombia", from the positive documents to generate more informative queries, while SyncSup tends to capture general keywords to create broad queries, which may lack the ability to distinguish between different documents. Besides, CTSyncSup can extract some necessary themes from the specific documents, such as the particular time "1993" and the adjective "increased", as shown in the second case.
Moreover, cases 4 and 5 show the effectiveness of our contrastive synthesis for biomedical domains. CTSyncSup can capture "lung" and "quarantine prevent" instead of general keywords, such as "sars cov" and "symptoms" often mentioned in COVIDrelated documents. These observations show that CTSyncSup can extract more specific information to generate more informative and discriminative queries for different target domains.
We further explore those synthetic instances that are assigned zero weights by MetaAdaptRank, such as the third and sixth cases. In the third case, although CTSyncSup captures the two keywords "language" and "osvaldo rodrigrez" from the positive document, its synthetic query is actually less relevant to the main topic of the positive document. For the sixth case, CTSyncSup fails to exclude the phrase "covid-19 pandemic" related to both the positive and negative documents, which causes the synthetic query unable to distinguish between them. Fortunately, MetaAdaptRank can effectively identify the synthetic instances whose relevance matching patterns between synthetic queries and positive documents are unclear or non-unique and then precludes such misleading synthetic supervision data by assigning them zero weights.