QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for Zero-Shot Commonsense Question Answering

Zero-shot commonsense Question-Answering (QA) requires models to reason about general situations beyond specific benchmarks. State-of-the-art approaches fine-tune language models on QA pairs constructed from CommonSense Knowledge Bases (CSKBs) to equip the models with more commonsense knowledge in a QA context. However, current QA synthesis protocols may introduce noise from the CSKBs and generate ungrammatical questions and false negative options, which impede the model's ability to generalize. To address these issues, we propose QADYNAMICS, a training dynamics-driven framework for QA diagnostics and refinement. Our approach analyzes the training dynamics of each QA pair at both the question level and option level, discarding machine-detectable artifacts by removing uninformative QA pairs and mislabeled or false-negative options. Extensive experiments demonstrate the effectiveness of our approach, which outperforms all baselines while using only 33% of the synthetic data, even including LLMs such as ChatGPT. Moreover, expert evaluations confirm that our framework significantly improves the quality of QA synthesis. Our codes and model checkpoints are available at https://github.com/HKUST-KnowComp/QaDynamics.


Introduction
The advent of various commonsense Question-Answering (QA) benchmarks (Talmor et al., 2021;Huang et al., 2019) has demonstrated that Pre-Trained Language Models (PTLMs) (Devlin et al., 2019;Lan et al., 2020) can achieve extraordinary performances when fine-tuned on these benchmarks.However, these neural systems have been criticized for only learning surface-level correlations and lacking general semantic reasoning abilities, which often require implicit commonsense * Equal Contribution knowledge (Branco et al., 2021;Zhou et al., 2021).To reliably assess the resilience of QA models across diverse domains, the zero-shot commonsense QA task has been proposed to evaluate the generalizable reasoning ability of a QA model (Li et al., 2020;Shwartz et al., 2020) without supervision signals from any QA benchmarks.Ma et al. (2021) introduced a technique for tackling this task by fine-tuning a PTLM on QA pairs synthesized from knowledge triples in Common-Sense Knowledge Bases (CSKBs).The head and relation of a triple were transformed into a question using natural language templates, with the tail serving as the answer.Distractors, or negative examples, were tails from triples sampled from the same CSKB using pre-defined strategies, such as keyword or embedding proximity filtering.However, the primary obstacle hindering further progress in this method is the quality of the synthetic QA dataset.This issue arises because manually-curated CSKBs often contain subtle but strong annotation artifacts (Zellers et al., 2019;Sakaguchi et al., 2021), which could provide easy backdoors for the model to perform exceptionally well on synthetic test sets but fail to generalize on heldout QA benchmarks.Additionally, the current QA synthesis process results in a significant number of ungrammatical questions, and the negative sampling strategy used to create distractors is not entirely effective in preventing false-negative options, as evidenced by Ma et al. (2021).
Despite the existence of dataset filtering algorithms, such as adversarial filtering (Zellers et al., 2018) for negative option selection, they have been shown to be less effective compared to random selection baselines (Ma et al., 2021).This is because they only focus on model uncertainty in the final predictions, which is not effective enough for synthetic data that contains a plethora of noise and imbalanced examples (Appx.§A.1).
Instead of leveraging data filtering based on model uncertainty in the final predictions, we draw inspiration from Swayamdipta et al. (2020) and employ training dynamics as a more precise indicator that studies instance learnability across all training steps.While the vanilla training dynamics regard each data instance as a whole without considering the learnability of each option or choice, we propose QADYNAMICS, a training dynamics-driven framework for synthetic QA diagnostic and refinement that favors choice-level diagnosis.Specifically, our approach proposes a novel schema that offers greater flexibility in deriving the training dynamics of multiple-choice QA with an arbitrary number of options, thus accommodating the varying number of choices in different commonsense QA benchmarks.QADYNAMICS then analyzes the training dynamics of each option, greedily drops the easy distractor to reduce the impact of CSKB artifacts, and eliminates QA pairs containing mislabeled or false-negative options according to the confidence gap between all options ( §3).Extensive experiments showcase the efficacy and data efficiency of our proposed framework, surpassing all previous zero-shot CSQA baselines while only leveraging 33% of training data and even outperforming GPT3.5 (Ouyang et al., 2022) and Chat-GPT ( §4.4).Further expert evaluations confirm the effectiveness of our proposed method in enhancing the quality of the synthetic QA set ( §4.5).

Related Works
2.1 Zero-shot Commonsense QA The task of zero-shot commonsense QA requires a QA model to perform generalizable QA towards commonsense questions from held-out benchmarks whose training data is inaccessible to the model.Existing approaches either leverage off-the-shelf language models in an unsupervised manner to unlock their commonsense capability with inference time mechanisms, such as self-talk (Shwartz et al., 2020), cloze translation (Dou and Peng, 2022), and dynamic graph reasoning (Bosselut et al., 2021), or inject commonsense knowledge into PLMs by fine-tuning them on synthetic QA pairs constructed from CSKBs (Ma et al., 2021;Kim et al., 2022;Wang et al., 2023a;Zhang et al., 2022).While unsupervised approaches only achieve satisfactory performances, existing works following the fine-tuning regime have shown exceptional performances on various commonsense QA benchmarks.However, fine-tuning heavily relies on the quality of training data, which is subject to limitations in both the knowledge quality and coverage in the CSKBs and the protocol for synthesizing them into QA pairs.Both of these are restricted by specific limitations, as discussed in §1.

Dataset Diagnostic
Diagnosing individual data instances within a large dataset has long been an important aspect of machine learning for NLP (Deng et al., 2023).Various data attribution methods have been proposed to retrieve training instances that may have led to a particular prediction (Pezeshkpour et al., 2021;Xie et al., 2023).Building on this, Pezeshkpour et al. (2022) proposed a method to efficiently detect dataset artifacts in the training data using data attribution methods when a challenging validation set is available.While these methods focus on the impact of individual instances on specific predictions, more generalized and precise dataset diagnostic approaches have also been proposed (Swayamdipta et al., 2020;Ethayarajh et al., 2022).These approaches aim to understand the difficulty of learning specific instances and can detect annotation artifacts and perform automatic data corrections, such as mislabeling detection.However, none of these methods explicitly consider QA benchmarks where each QA pair contains more than one piece of knowledge.To effectively evaluate the attribution of a QA pair, it is necessary to consider all possible options for fair consideration.

QADYNAMICS
This section outlines our proposed framework, QA-DYNAMICS, which consists of four steps: (1) Calculate the training dynamics for each option in a QA pair.(2) Refine the QA pair by eliminating the easy distractor.
(3) Filter out QA pairs that may be mislabeled or contain false-negative distractors.(4) Train the model using marginal ranking loss.

Preliminary
We follow the pipeline and task definition formulated by Ma et al. (2021) ∈ D} and test θ on held-out QA benchmarks.

Training Dynamics of QA Pairs
Following Ma et al. (2021), the QA model is trained through fine-tuning a pre-trained masked language model.For a given (Q, A) pair, Q is concatenated with every option A i ∈ A first to obtain the input sequence T i .We then repeatedly mask out a token in T i at one time and calculate the model's masked loss.The logit score of T i with n tokens is calculated by: Intuitively, the option with the lowest logit score will be selected as the answer.Based on this, we introduce our proposed schema for calculating the training dynamics of (Q, A) at both the pair level and option level.Following Swayamdipta et al. (2020), we train a QA model θ ′ on D Q and save denote T j as the input sequence with the second lowest logit of distractors among those containing a distractor, the model's confidence of T 1 (concatenation of Q and A 1 ) being correct is: Similarly, the confidence of a distractor's input sequence T i being wrong is defined as: Based on the confidences of all options, we formulate the confidence of a (Q, A) pair as: Finally, following Swayamdipta et al. (2020), we derive scores for each option and QA pair at each of the E checkpoints using the equations defined above.The final confidence and variability scores are obtained by calculating the average and standard deviation of these scores across E checkpoints (more detailed explanations in Appx.§A.1).

Option Selection
To reduce any artifacts present in the synthetic QA set that may have originated from the CSKBs, we adopt a similar approach to AFLite (Bras et al., 2020) and remove negative knowledge that the model can easily identify.We achieve this by discarding one distractor with the highest confidence score, which indicates that the model may be susceptible to exploiting potential biases and consistently assigns a high score to this option.We then concatenate the modified option set A ′ , containing the original ground-truth answer and m − 2 distractors that are more challenging to distinguish, with the original question Q to yield a more challenging (Q, A ′ ) pair.Such an option level selection strategy is termed as Difficult Choice.

QA Pair Selection
Next, to improve the quality of the synthetic QA set, we remove poor-quality QA pairs that contain the following two types of options: Mislabeled Ground-Truth Option.We remove the QA pairs whose correct answer is associated with very low confidence, indicating potentially being mislabaled (Swayamdipta et al., 2020).
False Negative Distractor.We remove QA pairs where the difference in confidence score between the ground-truth answer and the distractor with the highest confidence score is insignificant.This indicates the potential for a false negative.

Model Training
Finally, we fine-tune θ on our cleaned synthetic QA set using marginal ranking loss.With the score of each option defined in Equation (1), the marginal ranking loss, with η being the margin, is: 4 Experiments

Dataset Statistics
In our method, we set a threshold to filter out mislabeled and false negative data from the entire dataset.
Intuitively, it is essential to establish the accuracy and reliability of the data before proceeding with any further division or analysis.The threshold is decided based on rough observations of QAdynamic distributions, emphasizing the balance between quantity and quality.
The specific statistics are shown in Tab. 2. As mentioned by Ma et al. (2021), the human accuracy on ATOMIC and CWWV synthetic data is 78.0 and  80.7, respectively.The data discovered automatically by our strategy is 4.74% of total data, which is close to 25% of the poor-quality or grammatically wrong data.Most of them are located in the low-confidence areas, indicating our framework's contribution towards purifying the low-quality data.

Experiment Setup and Baselines
We use accuracy as the evaluation metric.To derive the QADYNAMICS of the synthetic QA entries, we use RoBERTa-large (Liu et al., 2019) as the backbone of θ ′ , and for our final QA model θ, we use DeBERTa-v3-large (He et al., 2023).Our choice to utilize different models is because RoBERTa-large results in faster training and inference speed, and intuitively, it is challenging to expect a model to learn from data that is itself difficult to learn.We compare our results with several baselines to demonstrate the effectiveness of our training dynamicsdriven data selection.First, we include those using 33%, 66%, and 100% synthetic QA pairs that are generated using keyword filtering or AFLite for distractor selection.We also report the performance of Large Language Models (LLMs), including GPT3.5 (Brown et al., 2020;Ouyang et al., 2022) and ChatGPT (OpenAI, 2022), as competitive baselines.To provide a fair comparison, we compare our framework with the original training dynamics-based data selection (Swayamdipta et al., 2020) with equal amounts of training data (33% and 66%).We select QA pairs that are easy-tolearn, ambiguous, and hard-to-learn, according to their confidence and variability distribution, and perform mislabeled correction on the hard-to-learn data, as done by Swayamdipta et al. (2020).For our framework, we utilize our proposed Difficult Choice selection ( §3.3) with a combination of QA pair selection strategies ( §3.4).Furthermore, we operate our framework on 50% of total QA pairs that have low confidence to show the effectiveness of our framework on hard-to-learn data.More explanations are provided in Appx.§A.1.

Results
The reliability of combining all proposed techniques in QADYNAMICS.Ablation studies are provided in Appx.§A.3.

The Effect of Option Selection
To verify the effectiveness of our option selections ( §3.3), we recruit five graduate students specializing in machine commonsense to evaluate the quality of 100 randomly sampled synthetic QA pairs selected by various strategies.The experts are asked to annotate whether a QA pair is plausible (question and answer forms a plausible commonsense knowledge), mislabeled (the ground-truth answer is incorrect), or contains any false-negative distractor (the distractor is semantically correct).Our results, presented in Tab. 3, are consistent with the targeting effect of both strategies, which successfully reduces the ratio of mislabeled examples and false-negative examples.We also observe that jointly adopting both strategies benefits all three metrics, which positively supports the success of our best system in §4.4.Case studies are provided in Appx.§B.

Conclusions
In this paper, we propose QADYNAMICS, a training dynamics-empowered framework for data-efficient zero-shot commonsense QA that jointly considers the learning difficulty at both the QA and option levels.Our framework, on average, achieves state-of-the-art performance by surpassing large language models and all baselines significantly with only 33% of training data.Further expert evaluations showcase that our proposed method effectively eliminates poor-quality QA entries in the synthetic dataset.

Limitations
The major limitation of QADYNAMICS is that our improved schema for assessing the training dynamics of a QA pair requires at least three options.This is because we consider all distractors when evaluating the confidence of the ground-truth answer and the entire QA pair, requiring more than one distractor to ensure precision.While most synthetic QA sets satisfy this requirement, there are also QA benchmarks that only have two options per question, such as WinoGrande (Sakaguchi et al., 2021) and aNLI (Nie et al., 2020).In such cases, the original training dynamics proposed by Swayamdipta et al. ( 2020) can be properly leveraged to deal with binary questions.We believe that this limitation is minor compared with the data-cleaning effect of QADYNAMICS.

Ethics Statement
This paper uses datasets and benchmarks solely for research purposes, consistent with their intended usage.The expert student annotators recruited for this study were well-trained and agreed to participate voluntarily without receiving any payment.
Since QADYNAMICS is a QA model and not a generative model, it does not yield additional biased content.Therefore, to the best of our knowledge, this paper does not involve any ethical concerns.
truth option is -1.In this case, the confidence assigned to the correct choice is 0.65, while the confidence level assigned to the distractors is uniformly 0.91, indicating the logits of the ground-truth answer is relatively lower.Moreover, a model in the early training stage may make random guesses toward the answer, with a probability of approximately 1/m for each candidate.The probability of correct choice should gradually approach 1, resulting in lower confidence in the ground-truth answer than the distractors.Additionally, affected by the false-negative distractors, the confidence in the correct option may be underestimated relative to the true value.To alleviate the effect of data imbalance and false negative choice, as defined in Equation ( 2), we compute the confidence by only comparing the logit score of the correct answer with the logit score of the easier distractor, which is less likely to be a false negative.To verify the above statements, we compute the density of the difference between the logits of ground-truth answers and distractors.As shown in figure Fig. 1, compared to Softmax, our method has a higher density in the vicinity of 0, indicating the difference between logit scores is decreased.It can be stated that our method narrows the distribution gap between positive and negative options.With the above definition, high confidence in correct choice indicates a high probability of being chosen, and low confidence may indicate the question is mislabeled.

&RQILGHQFH 'HQVLW\ 2XUV 6RIWPD[
Figure 1: The density of difference between the confidence of ground-truth answer and distractors.
Unlike natural language inference data, which is used in Dataset Cartography (Swayamdipta et al., 2020), when evaluating confidence for a given QA pair, we should consider the confidence of all available options.As a result, we define the confidence of a QA pair as Equation (4).A higher confidence level for a QA pair indicates that the positive choice is more likely to be selected, while the distractors are less likely to be chosen.To implement the Difficult Choice selection method, we remove one distractor with higher confidence.When we apply this method to the synthetic QA dataset, which has three candidates, 33% of the data is discarded, resulting in 66% of total data.For Hard-to-learn subset containing 50% of the total data, the amount of data becomes 33%.
As stated by Ma et al. (2021), the synthetic QA dataset includes ungrammatical questions as well as false negative distractors that appear plausible within the QA pair.Moreover, Dataset Cartography (Swayamdipta et al., 2020) suggests that confidence can also be used as a flag to identify mislabeled instances in the dataset.Thus, to deal with these two issues, we suggest two strategies: Mislabeled.removal and False-Neg.removal ( §3.4).Mislabeled.involves excluding QA pairs with a low-confidence ground truth answer, while False-Neg.involves excluding QA pairs with correct answers and distractors with similar logits.

A.2 Implementation Details
In this section, we introduce the implementations of our system.For hyperparameter tuning, following Ma et al. (2021), we set batch size 32, max sequence length 128, weight decay 0.01, warmup proportion 0.05.We use an AdamW optimizer (Loshchilov and Hutter, 2019) with the learning rate set to 5e-6 in all experiments.We evaluate our models on the validation set of synthetic datasets every 1000 steps and save the one with the highest validation accuracy.Each experiment is repeated with different random seeds three times, and the average performance is reported.For computing resources, all of our experiments are conducted on 4 NVIDIA RTX A6000 GPUs, each with 48G memory.Our code for zero-shot commonsense QA is mainly based on the code repository provided by Ma et al. (2021), and all of the pre-trained language models are from the Huggingface Transformers Library (Wolf et al., 2020).

A.3 Ablation Study
In this section, we study the ablation of different components of our framework to determine the impact of using different data selection methods and strategies.There are four critical components that refine the QA dataset: Low Confidence (hard-tolearn) Selection, Difficult Choice Selection, Mislabeled., and False-Neg..The data selection details include adopting Mislabeled.removal and False-Neg.removal on the total data, selecting 50% of total data with the lowest confidence, and discarding the distractor with higher confidence.
To study the effect of different components, we train DeBERTa-v3-Large as the backbone by sequentially dropping the four components mentioned above one at a time.Their out-ofdistribution performances on five different tasks are shown in Tab. 5.The results show that Difficult Choices Selection and Low Confidence Selection are effective strategies for improving the generalization ability of the model, and eliminating mislabeled examples and false negative distractors is also a useful approach for enhancing overall performance.

A.4 Experiments on non-synthesized datasets
To assess the efficacy of our approach on nonsynthesized datasets, we perform supplementary experiments using the training set of Common-senseQA (Talmor et al., 2019).Subsequently, we evaluate the model on both the validation set of CommonsenseQA and other datasets such as So-cialIQA (Sap et al., 2019b) and PIQA (Bisk et al., 2020).These datasets serve as in-domain and outof-domain evaluations, respectively.The results are shown in Appx.§A.4.We observe that the Hardto-learn, Ambiguous, and Difficult choice boost the out-of-domain performance compared to the baseline.It indicates that dropping easy choice con- tributes to better generalization ability of the model.

B Case Study
To further validate our framework, we present case studies in this section by showcasing instances selected by various strategies, which are illustrated in the Tab. 7. False-Neg.Removal.Ma et al. (2021) mentioned the presence of false negative distractors that are also plausible in the QA pair.To address this issue, we implemented False-Neg.removal, which is designed to detect false negative distractors.We provide three examples of this strategy in action below.As shown in Tab. 7, the confidence of the false negative distractor is consistently close to 0.5, suggesting that its logit value is always in proximity to that of the correct choice during the training process, which is in line with the definition of False-Neg..Moreover, based on these examples, we can infer that false negative distractors are often caused by insufficient content, resulting in multiple correct choices.Drop easy distractors.Method Difficult Choice discards the option with higher confidence.To better understand the effectiveness of this method, we analyze the easy choice and identify two commonly occurring phenomena.First, we observed that the easier distractor is often more likely to contain grammatical errors, as demonstrated in the Easy Distractors: Grammatical Error.Second, the easy choice frequently has poor contextual relevance, as shown in the Easy Distractors: Irrelevant to Context, while these two phenomena can also be found in other instances shown in Tab. 7. Removing options that exhibit these features improves the overall quality of the dataset, which can lead to better performance and more reliable results.

Mislabeled.
Removal.Following Swayamdipta  et al. (2020), we developed Mislabeled.to detect mislabeled examples, which involves extracting QA pairs with a ground answer that has low confidence.We have listed three mislabeled examples in the Tab. 7, where the confidence of the mislabeled option is relatively low compared to the other examples.

Figure 2 :
Figure 2: Demonstration of the synthetic QA dataset in the Zero-shot Commonsense QA task.

Table 2 :
Statistics of the number of QA pairs that are dropped by each strategy.

Table 4 :
Statistics of the validation set of each benchmark.