Masked Language Models Know Which are Popular: A Simple Ranking Strategy for Commonsense Question Answering

,


Introduction
Commonsense reasoning has been making progress over recent years (Rajani et al., 2019;Tamborrino et al., 2020;Lin et al., 2021;Liang et al., 2022Liang et al., , 2021)), arising from the advent and wide application of pre-trained language models (PLMs).Most current commonsense reasoning studies focus on multiple-choice question answering (QA), such as CommonsenseQA (Talmor et al., 2019) and Social IQa (Sap et al., 2019b), for which a welldesigned model is required to determine which of the candidate choices can best answer the question.However, such multiple-choice QA models may not be helpful in practical scenarios where candidate answers are not provided (e.g, answering a question asked in a search engine or during a conversation).
Towards this problem, Boratko et al. (2020) present a novel question/answer dataset ProtoQA for generative QA, in which several plausible answers are generated as a ranked list, rather than selected from candidates.For example, as shown in Figure 1, given a question "Name something that people often remember for a long time, even when they get old", a QA model is expected to generate commonsensical and typical answers which cover commonest clusters as many as possible.In this case, a combination of "first love", " friends", and "name" would receive the highest score when three answers are allowed.In this setting, generative language models (GLMs) are apt to generate plausible answers (Ma et al., 2021;Chang and McCallum, 2022).However, in our preliminary experiments, we observe that GLMs such as GPT-2 (Radford et al., 2019), T5 (Raffel et al., 2020), and BART (Lewis et al., 2020) have difficulty distinguishing the most typical answers from the rare ones.Meanwhile, Zhou et al. (2020) find that masked language models (MLMs) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) which utilize bidirectional contexts are more capable of learning commonsense knowledge than unidirectional LMs (UniLMs) such as GPT-2.Based on this observa-tion, we pose a question: Whether MLMs can be utilized to promote the typicality of answers generated by GLMs?
Figure 2: The pipeline framework and the end-to-end framework.In pipeline, the ranker reranks the generated answers to yield the final results.In end-to-end, the trained agent (GLM) directly yields the final results; during training phase, the agent passes the generated answers to the environment and receives the feedback rewards to update its parameters.
To this end, we propose a simple ranking strategy for MLMs to model the typicality of answers.In this strategy, an MLM is trained with the dataset purely without extra knowledge.To further increase the discrimination of the MLM, it is trained with the original answers as positive samples and negative samples gleaned from WordNet (Miller, 1994).After training, it serves as a ranker to find out the most popular answers among the generations by a fine-tuned GLM.
On top of that, we attempt to take advantage of MLMs' discrimination to improve the generation probability of typical answers by GLMs.Inspired by reinforcement learning (RL, Kaelbling et al., 1996), we construct a network with two PLMs: an agent (GLM) and an environment (MLM).We apply policy gradient (Sutton et al., 1999) to train the agent in three steps.First, the agent samples one answer for each question.Second, the environment, which is trained ahead with our ranking strategy, calculates a reward for every generated answer.Third, the agent updates its parameters according to both ground truth answers and the rewards from the environment.During inference, the final answers are generated by the trained agent without post-processing.The pipeline and the endto-end frameworks are illustrated in Figure 2.
We design a series of experiments on ProtoQA to comprehensively examine our proposed ranking strategy.We develop our trials from UniLM GPT-2 to T5 and BART, which are sequence-tosequence GLMs.For MLMs, we investigate BERT, RoBERTa, and DeBERTaV3 (He et al., 2021).The effectiveness of our strategy is evidenced by the experimental results: a leap (over 11 points) in the pipeline framework and a modest improvement (around 3 points) in the end-to-end framework.
Our research reveals that MLMs can learn to tell which answers are more popular, with little knowledge or even without knowledge.Moreover, they can guide the training of GLMs by providing higher rewards for the popular answers, and consequently the trained GLMs gain higher generation probabilities of typical answers.

Preliminary
We introduce the concept of reinforcement learning and the essential of policy gradient in this section.
Reinforcement Learning.Along with supervised learning and unsupervised learning, reinforcement learning (RL) is one of the three basic machine learning methods.It is composed of five elements: agent, environment, state, action, and reward.The agent takes actions within the environment and its states changes accordingly.Reversely, the environment feedbacks a reward to the agent.
Policy Gradient.In RL, the actor does not know whether an action is correct or not, it can only judge the quality of the action by the rewards.If an action gets more rewards, then the actor increases the probability of its occurrence; if fewer rewards, the probability decreases.Given a neural network with parameters θ and its state-action sequence τ , the expected reward of this network is the sum of the product of the likelihood of each sequence p θ (τ ) and its corresponding reward R(τ ).The objective function is max θ Rθ .
In our task, an answer may not be absolutely right or wrong, so a reward R(•) is introduced to indicate the typicality of the answer.In our end-toend framework, the agent, environment, and action are a GLM, an MLM, and answers generated by the GLM, respectively.The GLM is trained to receive a maximum expected reward.policy gradient applied for the training of GLMs in the end-to-end framework.

Ranking Strategy
We train MLMs with (question||answer, typicality) pairs to model the distribution of typical answers.
We denote a question with s words as q = {q 1 , q 2 , . . ., q s } and its original answer set with descending typicality as Ãq = {( Ãq 1 , c q 1 ), . . ., ( Ãq k , c q k )}, where each answer Ãq i is composed of u words {a q i1 , a q i2 , . . ., a q iu } and c is the typicality.We set the typicality of negative answers to zero.Then, the negative set is Āq = {( Āq 1 , 0), . . ., ( Āq n , 0)} and the compound answer set is A q = Ãq ∪ Āq .We assume that the frequency can depict the distribution of typical answers, where f req q i = c q i / k j=1 c q j .The typicality predicted by the ranker is obtained as follows: (2) where The scores are converted to an estimated probability of being typical by the softmax function so that negative values (scores) can be assigned to negative samples.Then, we compute the Kullback-Leibler divergence between the probability and the target distribution.σ(q, A q i ) = softmax(score q,A q i /t) (4) where t is the temperature hyperparameter.
The above formulas only determine the relative numeric relationship between positive and negative answers.To ensure answers' class labels (positive or negative), we consider binary cross entropy (L bce ) to constrain their value ranges.We use the least typical positive answer Ãq k and one negative answer Āq 1 to calculate the loss: )) The parameters of the ranker are updated with objective function min q (L kl (q) + L bce (q)).
This strategy is also applicable without negative answers and it becomes a knowledge-free ranking strategy, where the objective function is min q L kl (q).
In the pipeline framework, a GLM needs a standard fine-tuning.After training, the fine-tuned GLM generates an answer set for all input questions.Then, the ranker estimates the typicality of every question||answer with Eq 2 -Eq 3. The score is finally converted to range (0, 1) with sigmoid function.The most popular answers are regarded as the final results.

GLM's Training with Policy Gradient
Inspired by reinforcement learning, we set an agent and an environment in our end-to-end framework.The agent is a GLM which is responsible for answer generation.The environment is an MLM discerning how typical the generated answers are.The parameters of the environment are well-trained with the ranking strategy (Section 3.1) and fixed during the training of the agent.
Specifically, given a question q, the GLM samples several answers Âq .Then the MLM calculates the reward indicating the typicality for each answer Âq i and feeds it back to the agent.We optimize the GLM by maximizing the overall expected reward.Technically, this can be formulated as: where R denotes the reward yielded by the environment.It is a normalized value of the score (Eq 3) by the sigmoid function: In addition, we utilize ground truth answers Ãq i to supervise the training of the GLM and the cross entropy loss is defined as follow: where P ( Ãq i |q) is computed as Eq 8.As a whole, the objective function of the GLM is min q (α • L 1 (q) + β • L 2 (q)), where α and β are hyperparameters.

Dataset
We evaluate our methods on a generative commonsense QA dataset: ProtoQA (Boratko et al., 2020) 12 , instead of multiple-choice benchmarks.It consists of around 9k commonsense reasoning questions over prototypical situations.The dataset splits used in our experiments follow the partition of Boratko et al.'s. (8782,52, 102 pieces of questions for train, dev, and test).The average number of answers to each question in train set is 5.
The answers for test set of ProtoQA are possessed by AllenAI Leaderboards3 and not public.As a result, the traps of test data leakage and parameters overfitting are eliminated from our results.

Negative Samples Preparation
From preliminary experiments, we have found that MLMs embrace the ability to tell whether a sentence is grounded in commonsense or not.To refine the discriminative ability of the ranker / environment, we construct negative samples with the following three strategies.The gleaned negatives are displayed in Table 1.
Synset.For each answer in ProtoQA, from its "brothers" (hyponyms of its hypernym), we chose one furthest "brother" as the negative according to the jcn-similarity (Jiang and Conrath, 1997) between synsets in WordNet (Miller, 1994).
Definition.For each question in ProtoQA, we collect a bunch of negatives among the "brothers" and "father" (hypernym) of every answer according to their definitions in WordNet.The word whose definition embedding has less than 0.5 cosine similarity with those of all answers to the question is regarded as a negative sample.The sentence embedding of definition is obtained by bert-basenli-mean-tokens (Reimers and Gurevych, 2019).
Echo.We have observed that the GLMs would answer the questions with words in question stems, which are definitely not the expected answers in most cases.So we select words (nouns, verbs, and adjectives) appearing in the questions and their antonyms (only for adjectives) as the negatives.

Baselines
Due to the generative requirement of this task, oldfashioned classifier models are not applicable.We compare our methods with all the baselines reported by Boratko et al. (2020): Human, QA Model, GPT-2, and GPT-2 FT.We also report the results by fine-tuning T5 (Raffel et al., 2020) (Lewis et al., 2020) with the original dataset.In addition, there are two studies report only on dev set (Ma et al., 2021;Chang and McCallum, 2022), and we compare our results on dev set with theirs in Appendix D.

Evaluation
We follow the metrics proposed for ProtoQA by Boratko et al. (2020): Max Answers @ k and Max Incorrect @ k. 4 Employing the Hungarian matching algorithm (Kuhn, 1955;Munkres, 1957), the metrics compute the optimal matching between the answers and the clusters based on the reward matrix, where the rewards are equal to the size of clusters. 5The scores of WordNet Similarity are given by AllenAI Leaderboards.

Parameters
The experimental results are mainly produced by the following parameters.AdamW is the optimizer for all models.
GLMs.We fine-tune the GPT-2 model with a batch size of 8, gradient accumulation batch of 1, and the others following the parameters for the best performing model by Boratko et al. (2020).GLMs are trained for 1 epoch, with 1e-5 and 1e-3 learning rates for BART and T5 respectively.For a fair comparison, we follow their generation settings.
Agents.The environment is actually a ranker trained with the above parameters.The agent is trained with a weight decay of 1e-5.The epochs and coefficients (α, β) in the overall loss function vary with the GLM and MLM combinations.They are listed in Appendix B.

Results
In this section, we report results in both pipeline and end-to-end framework.Ablation study, further analysis of the two frameworks, and case study are located in Section 5.2, 5.3, 5.4, and 5.5.Results on dev set are listed in Appendix D and E. In addition, we test our method on multiple-choice dataset CommonsenseQA (Talmor et al., 2019) in Section 5.6.

Main Results
The main and best results are listed in Table 2.The GPT-2 RL model applies policy gradient to train the agent with rewards calculated by the environment.In the pipeline framework, we rerank the sampled answers generated by GLMs, denoted with "+rerank".We report reranked results of GPT-2 (vanilla), GPT-2 FT (fine-tuned with train set), and GPT-2 RL.It should be noted that the ranker in pipeline is DeBERTaV3 trained for 1 epoch with train set plus 10 negatives per question.
The ranker raises the scores of fine-tuned models more than 11 points on average, manifest than those of GPT-2 RL.It indicates that the typicality of answers is beneficial to the training of rankers; nevertheless, the environment in end-to-end provides a relatively weak influence for the training of the agent.The biggest leap is the reranked results of vanilla GPT-2, especially Max Answers @ 1 (41.9, even better than GPT-2 FT's 36.4).It corroborates that GPT-2 is an implicit source of world knowledge and thus, even without fine-tuning, the answers which are more popular can be excavated by the ranker.From the fact that the increases of GPT-2 FT+rerank and GPT-2 RL+rerank are almost on a par, we conjecture that policy gradient method increases the probability of typical answers and does not have deleterious effects on the overall diversity of generated answers.

Ablation Study
We conduct three ablation experiments (Table 3), and the corresponding findings are in bold.
A filter or binary ranker still works.For the pipeline framework, we degrade our softmax model to a filter, which filters GPT-2 FT's outputs ranked by the occurrence.The results in the second line shows that it can cross out less popular answers to increase the typicality.Moreover, we train a hard label ranker, where 1 is assigned to the ground truth and 0 to the negatives.Results of reranking with the binary ranker lying in the third line proves that our ranking strategy also works for a binary ranker.
The ranking strategy and policy gradient are reliable.For the end-to-end framework, we disregard the ground truth and eliminate L 2 (Eq 10) in the objective function.The results in the last line are much better than GPT-2's.It indicates that the reward is helpful to a vanilla PLM and therefore verifies the effectiveness of both our ranking strategy and applying policy gradient.Nevertheless, the result is worse than GPT-2 FT since the baseline was fine-tuned with ground truth.Therefore, the ground truth answers are still more valuable for PLM's training than pure reward.

Analysis on Ranking Strategy
The effectiveness of our simple ranking strategy is evaluated from two aspects: we experiment with 1) different numbers of epochs and the negatives on GPT-2; 2) different GLMs and MLMs.

Epochs and Negative Samples
To investigate the effect of the number of epochs and the negatives, we train the ranker with different (# epochs, # negatives) combinations and test them in the pipeline framework.We experiment with (1, 10), (1, 15), (5, 10), (5, 15), and (10, 5). Figure 3 depicts the scores and increments on each metric.Rankers of different combinations all significantly surpass the baseline on all metrics.Among them, there is a clear trend over the metrics, except for the Max Answers @ 1.
From the overall perspective, the ranker learns the most important weights in early epochs, especially the first epoch.The rankers trained with one epoch gain greatest average increase (top three in the rightmost bar chart).On one hand, the increase falls as the training epoch rises, still there being an apparent improvement in the worst case (8.3 points of average increase).On the other hand, the appropriate number of negatives is correlative to the number of epochs.For small epochs, too many negatives may distract the attention to the positive answers.For moderate epochs, which may lead to overfitting, the shortage of negatives would intensify the imbalance.This explains why (1, 10) outperforms (1, 15) while (5, 15) exceeds (5, 10).
To verify the effect of negative samples, we compare the one-epoch rankers trained with or without negatives. 6The solid line represents the ranker trained without negatives and it generally lies under (1, 10) and (1,15).Despite the fact that its overall increment is slightly inferior to (1, 10) and (1, 15), it exceeds the other combinations.So the negatives are not a decisive factor to our ranking strategy, but they are valuable to boosting the performance.Therefore, the negatives are the icing on the cake and our ranking strategy can be knowledge-free, without negatives.
The wide range of epochs and negatives demonstrate the robustness of our ranking strategy.

Choices of MLMs
Reranked results of different GLM+MLM7 combinations are displayed in Table 4.In Table 5 are the average scores and the standard deviations of the same MLM against different GLMs.Among the MLMs, there are large differences between their average scores but each model has a small variance.It indicates that, although MLMs have their own upper bound, their discrimination to answers generated by different GLMs are fairly stable.Taking both mean and standard deviation into consideration, DeBERTaV3 is the best ranker, followed by RoBERTa.The discriminative ability of BERT is insufficient, compared to the other two MLMs, and it even holds back the scores of GPT-2 FT.
The consistent reranked results of different GLMs further demonstrate that they can obtain the ability to answer commonsense questions after simple fine-tuning (sampled answers can cover typical ones for the rankers to choose from and to achieve similar reranked results).A suitable MLM can be utilized as a post-processer to effectively improve the typicality of the generations.

Analysis on Policy Gradient
In end-to-end mode, the typicality of generated answers by the agents trained with policy gradient are shown in Table 6.The majority of models improve the typicality of the first answer most (Max Answers @1), which also occurs in the pipeline.
However, the relative strength among MLMs is obscure.For GPT-2, DeBERTaV3 makes improvements in all metrics; RoBERTa has more merits than faults; BERT is deleterious for the whole, GLM Ranker Max Answers (%) Max Incorrect (%) △ (%) @ 1 @ 3 @ 5 @ 10 @ 1 @ 3 @ 5 Max Min Ave Table 4: The results of the pipeline framework on test set.FT denotes a fine-tuned GLM without reranking.The underlined scores are lower than the first row within the section.We also report our fine-tuned GPT-2 (in the second line) to show that it is our ranking strategy which contributes to the increase, rather than fine-tuning tricks.
Ranker AVE/ Max Answers (%) Max Incorrect (%) △ (%) STDE @ 1 @ 3 @ 5 @ 10 @ 1 @ 3 @  which may be the result of error propagation from the environment to the agent through the supervision signal.For T5, RoBERTa is the most helpful environment followed by DeBERTaV3 and BERT.
For BART, RoBERTa is still the best environment; BERT improves most on average but has one metric lagged behind; DeBERTaV3 hardly plays a role.
We conjecture that the typicality of answers is indigestible to GLMs, so their generation probability of popular answers increase modestly and similarly after RL, no matter which MLM serves as the environment.Due to the more elaborate selection of hyperparameters, the larger time consumption (Appendix B and C), and the smaller gains of end-to-end models than reranking, the pipeline framework is much more economic.

Case Study
Since the answers of test set are not public, we gather different models' outputs of a question in dev set (Figure 4), taking GPT-2 as an example.GPT-2 FT generates 4 typical answers which covers 3 clusters.All models in the pipeline pick out at Agent Environment Max Answers (%) Max Incorrect (%) △ (%) @ 1 @ 3 @ 5 @ 10 @ 1 @ 3 @ 5 Max Min Ave   least 6 popular answers, two more than the baseline, and they also cover more unique clusters.So do the end-to-end models, but they cover fewer clusters on average than the pipeline models.The priority of typical answers among models in pipeline varies wider than those in end-to-end.In addition, the order of answers generated by end-to-end models are similar to the baseline (+DeBERTaV3 vs GPT-2 FT) as well as to each other (+RoBERTa vs +BERT).These are consistent with previous experiment results: 1) pipeline models perform better than end-to-end models; 2) the increase of pipeline models differ significantly while end-to-end models have similar improvements; 3) Parameters of GLMs are slightly steered with policy gradient.

Results on CommonsenseQA
Although the number of answers varies with the question in this generative task, our method can be easily adapted for multiple-choice QA. 8 We assign 1 to the typicality value of the true answer and 0 to 8 Most existing methods are tailored to multiple-choice QA, but they are not adaptive to our answer generation task.
that of wrong answers.The multiple-choice ranker is trained with the same process as Section 3.1.During inference, for each question, the ranker calculates the probability of each (question, answer) pair.The highest one among all candidates is the predicted answer.
DeBERTaV3 is trained with learning rate of 1e-5 for 3 epochs.The accuracy on dev and test set of CommonsenseQA (Talmor et al., 2019) are 83.8 and 77.4 respectively. 9We ranked 9 among the single models and the Top-15 single models on the CommonsenseQA leaderboard10 are listed in Appendix G.

Related Work
Prior works evaluate LMs against commonsense benchmarks to probe the commonsense knowledge learned by LMs.Recent studies have shown that PLMs are implicit sources of world knowledge (Davison et al., 2019;Petroni et al., 2019).Et-tinger (2020) finds that BERT struggles with challenging commonsense inferences but it is robust on within-category distinctions.Meanwhile, Zhou et al. (2020) finds that MLMs are better at learning commonsense knowledge than unidirectional LMs.Based on these, we assume that PLMs can generate commonsensical answers and we utilize ProtoQA (Boratko et al., 2020) to explore the discriminative ability of MLMs.ProtoQA has been used by Ma et al. (2021) to study the effect of fine-tuning and prompt methods (Li and Liang, 2021;Shin et al., 2020) on PLM's learning process and by Chang and McCallum (2022) to compare the quality of the distributions generated by different LMs for answering ambiguous questions.Goodfellow et al. (2014) proposed a training method for generative models.To make it feasible to discrete probabilistic models, Yu et al. (2017) proposed SeqGAN, an extended GAN with a reinforcement learning-based generator (Sutton and Barto, 2018), to solve the sequence generation problem.Inspired by SeqGAN, we apply policy gradient methods (Sutton et al., 1999) to optimize the agent with a reward function to guide the policy, where an MLM serves as the environment.

Conclusion
In this work, we propose a simple ranking strategy for masked language models (MLMs) to find out typical answers from the generation of generative language models (GLMs).It is a knowledge-free strategy when training without the negatives.In addition, we apply policy gradient to the training of GLMs to improve the generation of GLMs in endto-end mode.Comprehensive experiments demonstrate the effectiveness of our proposed strategy and the discriminative ability of MLMs.

Limitations
Our research investigates the discriminative ability of MLMs in the pipeline framework and the end-to-end framework and these two paradigms both increase the typicality of generated answers.However, there still exists a large gap between performance of well-designed models and human's.In addition, reasoning of neural models lacks transparency and interpretability.We argue that it would be more beneficial to investigate reasoners that can utilize commonsense knowledge bases such as Con-ceptNet (Liu and Singh, 2004) and ATOMIC (Sap et al., 2019a).Knowledge-aware models can incor-porate external knowledge as relational inductive biases in an explicit way.This would enhance their reasoning capability and improve the transparency of model behaviors for more interpretable predictions.In addition, we report the reranked result of different GLM+MLM combinations on dev set in Table 11.

E Results of End-to-end Models on Dev Set
We report the results of GPT-2, BART, and T5 on dev set in Table 12.
F Metadata of the Example in Figure 4 Table 13 and Table 14 are the raw answers and clustered answers of dev question r2q15, respectively.

G CommonsenseQA Leaderboard
Top-15 single models are listed in Table 15.
Data split dev.crowdsourcedID r2q15 Question Name something that people often remember for a long time, even when they get old.

Figure 1 :
Figure 1: An example from ProtoQA dataset.The reasonable answers are collected and categorized into clusters with the numbers indicating their typicality.

Figure 3 :
Figure 3: Effects of the number of negatives and epochs.Dotted line with crosses is the baseline GPT-2 FT*; solid line with crosses is the ranker trained without negatives; dashed lines with various markers are trained with different (# epochs, # negatives) combinations.On the right side, the values in blue bars are the increases on each metric compared with the baseline, along with the average increase over all metrics.

Figure 4 :
Figure 4: An example of answer generation in the pipeline and the end-to-end frameworks.The first row is a question and its answers in clusters.The following are generations of GPT-2 FT, reranked outputs of GPT-2 FT, and generations of GPT-2 RL.The third column is the number of unique clusters covered by the generations.Typical answers are in green with superscripts indicating the clusters they belong to.Full data is listed in Appendix F.

Table 1 :
and BART The statistics and instances of negative samples under different strategies.Negatives are chosen for the answer sunscreen and from the question stem.Average denotes the average number of negatives for each question in train set.

Table 2 :
Boratko et al. (2020)est set.Rows with * are reported byBoratko et al. (2020).The rest are our methods.The △ column are max, min, and average increments over the metrics, compared with the first row within the section.All scores are evaluated by AllenAI Leaderboards.

Table 3 :
The results of ablation study on test set.The underlined scores are lower than the baseline GPT-2 FT*.

Table 5 :
The discrimination of different MLMs.The values are derived from Table4.AVE and STDE are the average and the standard deviation.

Table 6 :
The results of the end-to-end framework on test set.FT denotes a fine-tuned GLM without RL.

Table 9 :
Time consumption on average per epoch and the number of parameters.

Table 14 :
Answer clusters of dev question r2q15.