Noisy Pair Corrector for Dense Retrieval

Most dense retrieval models contain an implicit assumption: the training query-document pairs are exactly matched. Since it is expensive to annotate the corpus manually, training pairs in real-world applications are usually collected automatically, which inevitably introduces mismatched-pair noise. In this paper, we explore an interesting and challenging problem in dense retrieval, how to train an effective model with mismatched-pair noise. To solve this problem, we propose a novel approach called Noisy Pair Corrector (NPC), which consists of a detection module and a correction module. The detection module estimates noise pairs by calculating the perplexity between annotated positive and easy negative documents. The correction module utilizes an exponential moving average (EMA) model to provide a soft supervised signal, aiding in mitigating the effects of noise. We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS. Experimental results show that NPC achieves excellent performance in handling both synthetic and realistic noise.


Introduction
With the advancements in pre-trained language models (Devlin et al., 2019;Liu et al., 2019), dense retrieval has developed rapidly in recent years.It is essential to many applications including search engine (Brickley et al., 2019), open-domain question answering (Karpukhin et al., 2020a;Zhang et al., 2021), and code intelligence (Guo et al., 2021).A typical dense retrieval model maps both queries and documents into a low-dimensional vector space and measures the relevance between them by the similarity between their respective representations (Shen et al., 2014).During training, the model utilizes query-document pairs as labelled training data (Xiong et al., 2021) and samples negative documents for each pair.Then the model learns to minimize the contrastive loss for obtaining a good representation ability (Zhang et al., 2022b;Qu et al., 2021).
Recent studies on dense retrieval have achieved promising results with hard negative mining (Xiong et al., 2021), pretraining (Gao and Callan, 2021a), distillation (Yang and Seo, 2020), and adversarial training (Zhang et al., 2022a).All methods contain an implicit assumption: each query is precisely aligned with the positive documents in the training set.In practical applications, this assumption becomes challenging to satisfy, particularly when the corpora is automatically collected from the internet.In such scenarios, it is inevitable that the training data will contain mismatched pairs, incorporating instances such as user mis-click noise in search engines or low-quality reply noise in Q&A communities.As shown in Fig. 1, the examples are from StaQC benchmark (Yao et al., 2018), which is automatically collected from StackOverflow.The document, i.e., code solution, can not answer the query but is incorrectly annotated as a positive doc- Green objects refer to annotated pairs, while pentagram and triangle are actually aligned pairs.In the left case, retrieval models are required to push the query with true-positive document (TP Doc) together and pull the query with true-negative documents (TN Doc) apart.
In the right case, the retrieval models are misled by the mismatched data pair, where the false-positive document (FP Doc) and the false-negative document (FN Doc) are wrongly pulled and pushed, respectively.
ument.Such noisy pairs are widely present in automatically constructed datasets, which ultimately impact the performance of dense retrievers.
To train robust dense retrievers, previous works have explored addressing various types of noise.For example, RocketQA (Qu et al., 2021) and AR2 (Zhang et al., 2022a) mitigate the falsenegative noise with a cross-encoder filter and distillation, respectively; coCondenser (Gao and Callan, 2021b) reduce the noise during fine-tuning with pre-training technique; RoDR (Chen et al., 2022) deal with query spelling noise with local ranking alignment.However, mismatched-pair noise (false positive problem) in dense retrieval has not been well studied.As shown in Fig. 2, mismatched-pair noise will mislead the retriever to update in the opposite direction.
Based on these observations, we propose a Noisy Pair Corrector (NPC) framework to solve the falsepositive problem.NPC consists of noise detection and correction modules.At each epoch, the detection module estimates whether a query-document pair is mismatched by the perplexity between the annotated document and easy negative documents.Then the correction module provides a soft supervised signal for both estimated noisy data and clean data via an exponential moving average (EMA) model.Both modules are plug-and-play, which means NPC is a general training paradigm that can be easily applied to almost all retrieval models.
The contributions of this paper are as follows: (1) We reveal and extensively explore a long-neglected problem in dense retrieval, i.e., mismatched-pair noise, which is ubiquitous in the real world. (2) We propose a simple yet effective method for training dense retrievers with mismatched-pair noise.
(3) Extensive experiments on four datasets and comprehensive analyses verify the effectiveness of our method against synthetic and realistic noise.

Preliminary
Before describing our model in detail, we first introduce the basic elements of dense retrieval, including problem definition, model architecture, and model training.
Given a query q, and a document collection D, dense retrieval aims to find document d + relevant to q from D. The training set consists of a collection of query-document pairs, donated as where N is the data size.Typical dense retrieval models adopt a dual encoder architecture to map queries and documents into a dense representation space.Then the relevance score f (q, d) of query q and document d can be calculated with their dense representations: where E(•; θ) denotes the encoder module parameterized with θ, and sim is the similarity function, e.g., euclidean distance, cosine distance, inner-product.Existing methods generally leverage the approximate nearest neighbor technique (ANN) (Johnson et al., 2019) for efficient search.
For training dense retrievers, conventional approaches leverage contrastive learning techniques (Karpukhin et al., 2020a;Zhang et al., 2022b).Given a training pair (q i , d + i ) ∈ C, these methods sample m negative documents {d − i,1 , ..., d − i,m } from a large document collection D. The retriever's objective is to minimize the contrastive loss, pushing the similarity of positive pairs higher than negative pairs.Previous work (Xiong et al., 2021) has verified the effectiveness of the negative sampling strategy.Two commonly employed strategies are "In-Batch Negative" and "Hard Negative" (Karpukhin et al., 2020a;Qu et al., 2021).
The above training paradigm assumes that the query-document pair (q i , d + i ) in training set C is correctly aligned.However, this assumption is difficult to satisfy in real-world applications (Qu et al., i,1 is the hard negative of q i and {q 3 , d + 3 } is the estimated noisy pair, the retriever θ and teacher θ * compute similarity matrices S θ and S θ * for all queries and documents, respectively.The retriever learns to minimize (1) L cont : the negative likelihood probability of true positive documents; (2) L cons : the KL divergence between S θ and the rectified soft label S θ * after normalization.2021; Li et al., 2022;Wang et al., 2022).In practice, most training data pairs are collected automatically without manual inspection, such as inevitably leading to the inclusion of some mismatched pairs.

Method
We propose NPC framework to learn retrievers with mismatched-pair noise.As shown in Fig. 3, NPC consists of two parts: (a) the noise detection module as described in Sec.3.1, and (b) the noise correction module as described in Sec.3.2.

Noise Detection
The noise detection module is meant to detect mismatched pairs in the training set.We hypothesize that: dense retrievers will first learn to distinguish correctly matched pairs from easy negatives, and then gradually overfit the mismatched pairs.Therefore, we determine whether a training pair is mismatched by the perplexity between the annotated document and easy negative documents.
Specifically, given a retriever θ equipped with preliminary retrieval capabilities, and an uncertain pair (q i , d i ), we calculate the perplexity as follows: ) , (2) where τ is a hyper-parameter, d − i,j is the negative document randomly sampled from the document collection D. Note that d − i,j is a randomly selected negative document, not a hard negative.We discuss this further in Appendix C. In practice, we adopt the "In-Batch Negative" strategy for efficiency.
After obtaining the perplexity of each pair, an automated method is necessary to differentiate between the noise and the clean data.We note that there is a bimodal effect between the distribution of clean samples and the distribution of noisy samples.An example can be seen in Figure 4(b).Motivated by this, we fit the perplexity distribution over all training pairs with a two-component Gaussian Mixture Model (GMM): where π k and ϕ (PPL | k) are the mixture coefficient and the probability density of the k-th component, respectively.We optimize the GMM with the Expectation-Maximization algorithm (Dempster et al., 1977).Based on the above hypothesis, we treat training pairs with higher PPL as noise and those with lower PPL as clean data.So the estimated clean flag can be calculated as follows: where ŷi ∈ {1, 0} denotes whether we estimate the pair (q i , d i ) to be correctly matched or not, κ is the GMM component with the lower mean, λ is the threshold.p(κ | PPL (q i ,d i ,θ) ) is the posterior probability over the component κ, which can be intuitively understood as the correctly annotated con-fidence.We set λ to 0.5 in all experiments.Note that before noise detection, the retriever should equip with preliminary retrieval capabilities.This can be achieved by initializing it with a strong unsupervised retriever or by pre-training it on the entire noise dataset.

Noise Correction
Next, we will introduce how to reduce the impact of noise pairs after obtaining the estimated flag set {ŷ i } N i=1 .One quick fix is to discard the noise data directly, which is sub-optimal since it wastes the query data in noisy pairs.In this work, we adopt a self-ensemble teacher to provide rectified soft labels for noisy pairs.The teacher is an exponential moving average (EMA) of the retriever, and the retriever is trained with a weight-averaged consistency target on noisy data.
Specifically, given a retriever θ, the teacher θ * is updated with an exponential moving average strategy as follows: where α is a momentum coefficient.Only the parameters θ are updated by back-propagation.For a query q i and the candidate document set D q i , where D q i = {d i,j } m j=1 could consist of annotated documents, hard negatives and in-batch negatives, we first get teacher's and retriever's similarity scores, respectively.Then, the retriever θ is expected to keep consistent with its smooth teacher θ * .To achieve this goal, we update the retriever θ by minimizing the KL divergence between the student's distribution and the teacher's distribution.
To be concrete, the similarity scores between q i and D q i are normalized into the following distributions: Then, the consistency loss L cons can be written as: where KL(•) is the KL divergence, p θ (.|q i ; D q i ) and p θ * (.|q i ; D q i ) denote the conditional probabilities of candidate documents D q i by the retriever θ and the teacher θ * , respectively.
For the estimated noisy pair, the teacher corrects the supervised signal into a soft label.For the Calculate PPL of training pairs with random negatives using Eq.2; 5: Fit PPL distribution with GMM; 6: Get the estimated flag set {ŷ i } using Eq.4; 7: for i = 1 : num_batch do 8: Sample negatives with "In-Batch Negative" or "Hard Negative" strategy; 9: Calculate rectified soft labels with EMA model θ * ; 10: Train θ by optimizing Eq.8; 11: Update EMA model θ * using Eq.5; 12: end for 13: end for estimated clean pair, we calculate the contrastive loss and consistency loss.So the overall loss is formalized: where ŷi ∈ {1, 0} is estimated by the noise detection module.

Overall Procedure
NPC is a general training framework that can be easily applied to most retrieval methods.Under the classical training process of dense retrieval, We first warmup the retriever with the typical contrastive learning method to provide it with basic retrieval abilities, and then add the noise detection module before training each epoch and the noise correction module during training.The detail is presented in Algorithm 1.
StaQC is a large dataset that collects real querycode pairs from Stack Overflow * .The dataset has  (Heyman and Van Cutsem, 2020;Li et al., 2022).If the results are not provided, we mark them as "-".
been widely used on code summarization (Peddamail et al., 2018) and code search (Heyman and Van Cutsem, 2020).SO-DS mines query-code pairs from the most upvoted Stack Overflow posts, mainly focuses on the data science domain.Following previous works (Heyman and Van Cutsem, 2020;Li et al., 2022), we resort to Recall of topk (R@k) and Mean Reciprocal Rank (MRR) as the evaluation metric.StaQC and SO-DS are constructed automatically without human annotation.Therefore, there are numerous mismatched pairs in training data.
Natural Questions (NQ) collects real queries from the Google search engine.Each question is paired with an answer span and golden passages from the Wikipedia pages.Trivia QA (TQ) is a reading comprehension dataset authored by trivia enthusiasts.During the retrieval stage of both datasets, the objective is to identify positive passages from a large collection.Positive pairs in these datasets are assessed based on strict rule, i.e., whether passages contain answers or not (Karpukhin et al., 2020a).Consequently, we consider these datasets to be of high quality.Thus, we leverage them for simulation experiments to quantitatively analyze the impact of varying proportions of noise.Drawing inspiration from the setting in the noisy classification task (Han et al., 2018), we simulate the mismatched-pair noise by randomly pairing queries with unrelated documents.

Implementation Details
NPC is a general training paradigm that can be directly applied to almost all retrieval models.For StaQC and SO-DS, we adopt UniXcoder (Guo et al., 2022) as our backbone, which is the SoTA model for code representation.Following Guo et al. (2022), we adopt the cosine distance as a similarity function and set temperature λ to 20.We update model parameters using the Adam optimizer and perform early stopping on the development set.The learning rate, batch size, warmup epoch, and training epoch are set to 2e-5, 256, 5, and 10, respectively.In the "Hard Negative" setting, we adopt the same strategy as Li et al. (2022).
For NQ and TQ, we adopt BERT (Devlin et al., 2019) as our initial model.Following Karpukhin et al. (2020a), we adopt inner-product as the similarity function and set temperature λ to 1.The max sequence length is 16 for query and 128 for passage.The learning rate, batch size, warmup epoch, and training epoch are set to 2e-5, 512, 10, and 40, respectively.We adopt "BM25 Negative" and "Hard Negative" strategies as described in the DPR toolkit † .For a fair comparison, we implement DPR (Karpukhin et al., 2020a) with the same hyperparameters.All experiments are run on 8 NVIDIA Tesla A100 GPUs.The implementation of NPC is based on Huggingface (Wolf et al., 2020).

Results
Results on StaQC and SO-DS: Table 1 shows the results on the realistic-noisy datasets StaQC and SO-DS.Both datasets contain a large number of real noise pairs.The first block shows the results of previous SoTA methods.BM25 desc is a traditional Noisy

Methods
Natural Questions Trivia QA R@1 R@5 R@20 R@100 R@1 R@5 R@20 R@100 sparse retriever based on the exact term matching of queries and code descriptions.NBOW is an unsupervised retriever that leverages pre-trained word embedding of queries and code descriptions for retrieval.USE is a simple dense retriever based on transformer.CodeBERT, GraphCodeBERT are pre-trained models for code understanding using large-scale code corpus.CodeRetriever is a pretrained model dedicated to code retrieval, which is pre-trained with unimodal and bimodal contrastive learning on a large-scale corpus.UniXcoder is also a pretrained model that utilizes multi-modal data, including code, comment, and AST, for better code representation.The results are implemented by ourselves for a fair comparison with NPC.The bottom block shows the results of NPC using two negative sampling strategies.
From the results, we can see that our proposed NPC consistently performs better than the evaluated models across all metrics.Compared with the strong baseline UniXcoder which ignores the mismatched-pair problem, NPC achieves a significant improvement with both "in-batch negative" and "hard negative" sampling strategies.It indi-cates that the mismatched-pair noise greatly limits the performance of dense retrieval models, and NPC can mitigate this negative effect.We also show some noisy examples detected by NPC in Appendix A.
Results on NQ and TQ: Table 2 shows the results on the synthetic-noisy datasets NQ and TQ under the noise ratio of 20%, and 50%.We compare NPC with BM25 (Yang et al., 2017) and DPR (Karpukhin et al., 2020a).BM25 is an unsupervised sparse retriever that is not affected by noisy data.DPR (Karpukhin et al., 2020a) is a widely used method for training dense retrievers.coCondenser (Gao and Callan, 2021b) leverage pre-training to enhance models' robustness.Rock-etQA (Qu et al., 2021) adopts a cross-encoder to filter false negatives in the "Hard Negative" strategy.Co-teaching (Han et al., 2018) uses the samples with small loss to iteratively train two networks, which is widely used in the noisy label classification task.We implement baselines using two negative sampling strategies.Besides, we evaluate DPR on clean datasets by discarding the syntheticnoisy pairs, denoted by DPR-C.DPR-C is a strong

NQ StaQC
De Co HN R@20 R@100 R@1 R@3 baseline that is not affected by mismatched pairs.We can observe that (1) As the noise ratio increases, DPR, coCondenser, and RocketQA experience a significant decrease in performance.At a noise rate of 50%, they perform worse than unsupervised BM25.(2) Despite Co-teaching having good noise resistance, its performance is still low.This indicates that methods for dealing with label noise in classification are not effective for retrieval.(3) NPC outperforms baselines by a large margin, with only a slight performance drop when the noise increases.Even comparing DPR-C, NPC still achieves competitive results.

Analysis
Ablations of Noise Detection and Noise Correction: To get a better insight into NPC, we conduct ablation studies on the realistic-noisy dataset StaQC and the synthetic-noisy dataset NQ under the noise ratio of 50%.The results are shown in Table 3. "De" and "Co" refer to noise detection and noise correction, respectively."HN" indicates whether to perform "Hard Negative" strategy.For both synthetic noise and realistic noise, we can see that the noise detection module brings a significant gain, no matter which negative sampling strategy is used.Correction also enhances the robustness of the retriever since it provides rectified soft labels which can lead the model output to be smoother.The results show that combining the two obtains better performance compared with only using the detection module or correction module.
Impact of Warmup Epoch: According to the foregoing, NPC starts by warming up.In Table 4, we pre-training the retriever on the noisy dataset for warming up, and show the performance of NPC with different various epoch numbers n.In this Setting R@1 R@5 R@20 R@100  experiment, we adopt "Hard Negative" sampling strategy.We find that NPC achieves good results when the warmup epoch is relatively small (1 − 10).However, when the warmup epoch is too large, the performance will degrade.We believe that a prolonged warmup causes overfitting to noise samples.Impact of Iterative Detection: In the training of NPC, we perform iterative noise detection every epoch.A straightforward approach is to detect the noise only once after warmup and fix the estimated flag set {ŷ i }.To study the effectiveness of iterative detection, we conducted an ablation study.The results are shown in Table 5.We can see that the model performance degrades after removing iterative detection.
Ablations of PPL: We distinguish noise pairs according to the perplexity between the annotated positive document and easy negatives.When calculating the perplexity, "Hard Negative" will cause trouble for detection.We construct ablation experiments to verify this, and the results are shown in Table 5.We can see that the perplexity with "Hard Negative" results in performance degradation.
Visualization of Perplexity Distribution: In Fig. 4, we illustrate the perplexity distribution of Setting R@1 R@5 R@20 R@100 NPC   training pairs before and after warmup, after training with DPR, and after training with NPC.The experiment is on NQ under the noise ratio of 50%.We can see that the perplexity of most noisy pairs is larger than the clean pairs after warmup, which verifies our hypothesis in Sec.3.1.Comparing Fig. 4(c) and Fig. 4(d), we find that the retriever trained with DPR will overfit the noise pairs.However, NPC enables the retriever to correctly distinguish clean and noisy pairs because it avoids the dominant effect of noise during network optimization.
Analysis of Generalizability Fig. 5 shows the performance of DPR and NPC under the noise ratio ranging from 0% to 80%.We can see that as the noise ratio increases, the performance degradation of DPR is much larger than that of NPC, which demonstrates the generalizability of NPC.Furthermore, even though NPC is designed to deal with mismatched-pair noise, it achieves competitive results when used in a noise-free setting.
5 Related Work

Dense Retrieval
Dense retrieval has shown better performance than traditional sparse retrieval methods (Lee et al., 2019;Karpukhin et al., 2020a).The studies of dense retrieval can be divided into two categories, (1) unsupervised pre-training to get better initialization and (2) more effective fine-tuning on labeled data.In the first category, some researchers focus on how to generate contrastive pairs automatically from a large unsupervised corpus (Lee et al., 2019;Chang et al., 2019;Ma et al., 2022;Li et al., 2022).Another line of research enforces the model to produce an information-rich CLS representation (Gao and Callan, 2021a,b;Lu et al., 2021).As for effective fine-tuning strategies (He et al., 2022b), recent studies show that negative sampling techniques are critical to the performance of dense retrievers.DPR (Karpukhin et al., 2020b) adopts in-batch negatives and BM25 negatives; ANCE (Xiong et al., 2021), RocketQA (Qu et al., 2021), and AR2 (Zhang et al., 2022a) improve the hard negative sampling by iterative replacement, denoising, and adversarial framework, respectively.Several works distill knowledge from ranker to retriever (Izacard and Grave, 2020;Yang and Seo, 2020;Ren et al., 2021;Zeng et al., 2022).Some studies incorporate lexical-aware sparse retrievers to convey lexical-related knowledge to dense retrievers, thereby enhancing the dense retriever's ability to recognize lexical matches (Shen et al., 2023;Zhang et al., 2023).
Although the above methods have achieved promising results, they are highly dependent on correctly matched data, which is difficult to satisfy in real scenes.The mismatched-pair noise problem has seldom been considered.Besides, some studies utilize large-sized generative models (He et al., 2023) to guide retrievers, which achieve impressive performance without paired data (Sachan et al., 2022(Sachan et al., , 2021;;Gao et al., 2022;He et al., 2022a).Although these models exhibit some robustness to noisy data, their success depends on the availability of strong generative models.Moreover, their applicability will be limited in domains where generative models do not perform well.

Denoising Techniques
One related task to our work is Noisy Label.Numerous methods have been proposed to solve this problem, and most of them focus on the classification task (Han et al., 2020).Some works design robust loss functions to mitigate label noise (Ghosh et al., 2017;Ma et al., 2020).Another line of work aims to identify noise from the training set with the memorization effect of neural networks (Silva et al., 2022;Liang et al., 2022;Bai et al., 2021).
These studies mainly focus on classification.NPC studies the mismatched noise problem in dense retrieval rather than the noise in category annotations, which is more complex to handle.Several pre-training approaches noticed the problem of mismatched noisy pairs.ALIGN (Jia et al., 2021) and CLIP (Radford et al., 2021) claim that utilizing large-scale image-text pairs can ignore the existence of noise.E5 (Wang et al., 2022) employs a consistency-based rule to filter the pre-training data.Although they slightly realized the existence of noisy pairs during pre-train, none of them give a specialized solution to solve it and extensively explored the characteristics of noisy text pairs.Some recent works (Huang et al., 2021;Han et al., 2023) study the noisy correspondence problem in crossmodal retrieval.Although the "mismatched-pair noisy" problem in cross-modal retrieval and text retrieval shares similarities, the specific settings and methods used in these two areas are notably distinct.it is challenging to directly apply these cross-modal retrieval works to document and code retrieval.Our NPC is the first systematic work to explore mismatched-pair noise in document/code retrieval.

Conclusion
This paper explores a neglected in dense retrieval, i.e., mismatched-pair noise.To solve this problem, we propose a generalized Noisy Pair Corrector(NPC) framework, which iteratively detects noisy pairs per epoch based on the perplexity and then provides rectified soft labels via an EMA model.The experimental results and analysis demonstrate the effectiveness of NPC in effectively handling both synthetic and realistic mismatchedpair noise.

Limitations
This work mainly focuses on training the dense retrieval models with mismatched noise.There may be two possible limitations in our study.
1) Due to the limited computing infrastructure, we only verified the robustness performance of NPC based on the classical retriever training framework.We leave experiments to combine NPC with more effective retriever training methods such as distillation (Ren et al., 2021), AR2 (Zhang et al., 2022a), as future work.
2) Mismatched-pair noise may also exist in other tasks, such as recommender systems.We will consider extending NPC to more tasks.

Figure 1 :
Figure 1: Two examples from StaQC training set.In the bottom example, the given code is mismatched with the query, since it can not answer the query.

Figure 2 :
Figure 2: Effect of matched & mismatched pair for training.Green objects refer to annotated pairs, while pentagram and triangle are actually aligned pairs.In the left case, retrieval models are required to push the query with true-positive document (TP Doc) together and pull the query with true-negative documents (TN Doc) apart.In the right case, the retrieval models are misled by the mismatched data pair, where the false-positive document (FP Doc) and the false-negative document (FN Doc) are wrongly pulled and pushed, respectively.

Figure 3 :
Figure 3: Overview of noise detection and noise correction.(a) Procedure of Noise Detection.At each epoch, we first calculate the perplexity of all training query-document pairs using the retriever θ; next fit the perplexity distribution with Gaussian Mixture Model to get the correctly matched probability of each pair; finally estimate the flag set {ŷ i } N i=1 by setting the threshold.(b) Framework of Noise Correction.Given a batch of data pairs, where d −i,1 is the hard negative of q i and {q 3 , d + 3 } is the estimated noisy pair, the retriever θ and teacher θ * compute similarity matrices S θ and S θ * for all queries and documents, respectively.The retriever learns to minimize (1) L cont : the negative likelihood probability of true positive documents; (2) L cons : the KL divergence between S θ and the rectified soft label S θ * after normalization.

Figure 4 :
Figure 4: Perplexity distribution of training pairs under different settings.

Figure 5 :
Figure 5: Retrieval performance of DPR and NPC on NQ dev set under different noise ratios.

Table 1 :
Retrieval performance on StaQC and SO-DS, which are realistic-noisy datasets.The results of the first block are mainly borrowed from published papers

Table 2 :
Karpukhin et al. (2020a) Natural Questions and Trivia QA under the noise ratio of 20%, and 50%, respectively.The results of BM25 * and DPR * are borrowed fromKarpukhin et al. (2020a).If the results are not provided, we mark them as "-".

Table 3 :
Ablation studies on StaQC dev set and NQ dev set under noise ratio of 50%.

Table 4 :
Performance of NPC on NQ dev set with different warmup epoch number n.

Table 5 :
Ablation studies of iterative noise detection and perplexity variants