A Semantic-based Method for Unsupervised Commonsense Question Answering

Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data. Among existing work, a popular solution is to use pre-trained language models to score candidate choices directly conditioned on the question or context. However, such scores from language models can be easily affected by irrelevant factors, such as word frequencies, sentence structures, etc. These distracting factors may not only mislead the model to choose a wrong answer but also make it oversensitive to lexical perturbations in candidate answers. In this paper, we present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering. Instead of directly scoring each answer choice, our method first generates a set of plausible answers with generative models (e.g., GPT-2), and then uses these plausible answers to select the correct choice by considering the semantic similarity between each plausible answer and each choice. We devise a simple, yet sound formalism for this idea and verify its effectiveness and robustness with extensive experiments. We evaluate the proposed method on four benchmark datasets, and our method achieves the best results in unsupervised settings. Moreover, when attacked by TextFooler with synonym replacement, SEQA demonstrates much less performance drops than baselines, thereby indicating stronger robustness.

In this paper, we present a novel SEmanticbased Question Answering method (SEQA) for unsupervised commonsense question answering. Instead of directly scoring each answer choice, our method first generates a set of plausible answers with generative models (e.g., , and then uses these plausible answers to select the correct choice by considering the semantic similarity between each plausible answer and each choice. We devise a simple, yet sound formalism for this idea and verify its effectiveness and robustness with extensive experiments. We evaluate the proposed method on four benchmark datasets, and our method achieves the best results in unsupervised settings. Moreover, when attacked by TextFooler (Jin et al., 2020) with synonym replacement, SEQA demonstrates much less performance drops than baselines, thereby indicating stronger robustness.

Introduction
Pre-trained language models have been widely used for commonsense question answering. Finetuning pre-trained models on task-specific data produces many state-of-the-art results (Wang et al., 2020; * Equal contribution † Corresponding author: Minlie Huang. Figure 1: Two examples of commonsense question answering, where the baseline (Pro-A) is oversensitive to lexical perturbations (SR for synonym replacement and ST for sentence structure transformation). The scores from Pro-A and our method for each answer choice are shown in the right columns. The underlined score indicates the answer choice selected by a method. Khashabi et al., 2020;Lin et al., 2019). However, this requires amounts of labeled task data. Therefore, it is vital to study unsupervised commonsense question answering without relying on any labeled downstream task data. In this paper, we investigate multiple-choice commonsense question answering tasks in an unsupervised setting: given a question and a set of answer choices, a model is required to predict the most reasonable answer choice for the question, but without access to any labeled task data.
Many existing unsupervised methods tackle these tasks by scoring each answer choice using a language model, e.g., estimating the generative probability of the answer choice conditioned on the question (Trinh and Le, 2018; Shwartz et al., 2020;Tamborrino et al., 2020). Table 1 lists several typical score functions. However, these scores can be easily influenced by word frequencies, sentence structures, and other factors, which can mislead the models and make existing methods oversensitive to lexical perturbations (Abdou et al., 2020;Tamborrino et al., 2020). Figure 1 shows two examples. The correct choices are paraphrased via synonym replacement or structure transformation. In these examples, the baseline (Pro-A) produces much lower scores for the paraphrased choices and chooses the wrong choices.
Since existing methods can be easily distracted by irrelevant factors such as lexical perturbations, we argue that a commonsense question answering method should focus on the answers' semantics and assign similar scores to synonymous choices. To this end, we introduce a novel SEmantic-based Question Answering model, SEQA, which aims to robustly select correct answers in multi-choice commonsense question answering in an unsupervised setting. Instead of directly scoring an answer choice, we calculate the probability of observing the choice's semantics. A choice's semantic score can be obtained by summing the generative probabilities of sentences that have the same semantic meanings with the choice, where the sentences are called the choice's supporters. However, it is hard to obtain the supporters which have exactly the same semantic meanings with the choice, so we reformulate the semantic score into a soft version as explained in Section 3.2. Each supporter is weighed by the semantic similarity to the answer choice, which can be computed with some off-the-shelf models, such as Sen-tenceBERT (Reimers and Gurevych, 2019). Since the supporters and their weights depend on the semantics rather than the surface form of the answer choice, by this means, the effects of the distracting factors can be largely suppressed. Moreover, synonymous choices are likely to share the same set of supporters, so their scores are expected to be stably close. Our contributions in this paper are summarized as follows: • We propose a semantic-based question answering model (SEQA) for robust commonsense question answering in an unsupervised setting. Instead of directly scoring the answer choices, our method first generates some plausible answers and then uses them to select the correct choice by considering the semantic similarity between each plausible answer and each choice. • We conduct experiments on four commonsense question answering datasets, where SEQA achieves the best performance com-  pared with strong baselines. When attacked by TextFooler (Jin et al., 2020) with synonym replacement, our method performs remarkably more robustly.

Related Work
Previous work has explored pre-trained language models (LMs) for unsupervised commonsense question answering. In general, these approaches treat LMs as question answering modules. Table 1 shows three representative methods, which do not use external knowledge and rely fully on the implicit knowledge encoded in LMs for reasoning. Probability-A (Pro-A) considers the generative probability of the choice conditioned on the question. However, it suffers from the statistical bias of choices, such as word frequency and sentence length (Abdou et al., 2020). To alleviate this, MutualInfo-QA (MI-QA) calculates the mutual information between the question and the choice. Another way to reduce the impact of statistical bias is to score each choice using the conditional probability of the question rather than the choice (Trinh and Le, 2018;Tamborrino et al., 2020) , which is denoted as Probability-Q (Pro-Q) in Table 1.
Some recent work claims that external knowledge can benefit commonsense reasoning. Besides static knowledge bases (KBs), such as Concept-Net (Speer et al., 2017) and Atomic (Sap et al., 2019a), there are also numerous studies treating LMs as dynamic KBs. Petroni et al. (2019) shows that LMs can be used for KB completion. And Davison et al. (2019) shows that BERT can distinguish true and fake ConceptNet triplets. Further, the extracted knowledge can work as complementary information for answering a question. monSenseQA (Talmor et al., 2019) that generates explanations for questions, which are then used as additional inputs. The shortcoming of this approach is that it requires collecting human explanations for each new dataset to fine-tune LMs. Some following researches explore unsupervised explanation/knowledge generator. CGA (Bosselut and Choi, 2019) employs COMET  to generate intermediate inferences which are then used to score the choice. However, COMET is limited by a small set of question types so that CGA is difficult to generalize to different domains. Self-Talk (Shwartz et al., 2020) breaks the limit by extracting knowledge from GPT-2 (Radford et al., 2019), which has no restriction on the query types. Thus, Self-Talk can be applied to a wide range of domains. Despite the introduction of auxiliary information, these methods are essentially dependent on language model scores, so they are still sensitive to lexical perturbations.
Besides directly using pre-trained LMs, some recent efforts have been dedicated to automatically constructing task-specific data to train commonsense reasoners in zero-shot settings. Wang et al. (2019) and Kocijan et al. (2019) provide some rules to construct labeled training data from large corpus for pronoun disambiguation. Banerjee and Baral (2020), Moghimifar et al. (2020) and Ma et al. (2020) collect training data based on knowledge bases, such as Atomic (Sap et al., 2019a). Though effective, they are limited by the specific task settings or highly dependent on the task-related knowledge bases, which makes them difficult to transfer to other commonsense reasoning tasks.

Method
In this paper, we focus on unsupervised multiplechoice commonsense question answering, which is formalized as follows: given a question and a set of choices, models should select the correct choice: where s refers to a score function. Note that we have no access to any labeled task data.

Motivation
In existing unsupervised methods, the score functions are usually defined based on the language model scores. Taking Pro-A (Table 1) as an example, it first converts the question into a statement: And it then takes the statement as a prompt to calculate the generative probability of each choice. Note that the templates for rewriting is not the focus of this paper, and hence we directly use the templates of previous work (Shwartz et al., 2020;Tamborrino et al., 2020) for our method and all the baselines in this paper (see Appendix for details).
Though successful, language model scores can be affected by many distracting factors, such as word frequency and sentence structure, etc. These factors can disturb the score functions to a large extent, as shown in Figure 1. Our goal is to alleviate the influence of these distracting factors. Hence we propose a new method for unsupervised commonsense question answering, which achieves better results and performs more robustly.

SEQA
SEQA is designed to predict the semantic score of an answer choice A. Instead of directly estimating the probability P (A|Q) of the single choice A, the semantic score focuses on the probability P (M A |Q) where M A represents A's semantics. Ideally, we decompose P (M A |Q) into the summation of the conditional probabilities of A's supporters, where the supporters indicates all possible answers that have exactly the same semantics M A . Formally, the semantic score is defined as (2) S A is the set of supporters of choice A, and A is the set of all possible answers. I(S ∈ S A ) is an indicator function indicating whether S is a supporter of A. To obtain the supporter set S A , we adopt a model to extract the sentence-level semantic features. Ideally, the indicator function is defined as where h A is the semantic features of sentence A, and we assume that S and A are exactly the same in semantics if h S and h A point in the same direction. However, Eq.
(3) uses a hard constraint that cos(h S , h A ) exactly equals to 1, which can be too strict to find acceptable supporters. Therefore, we reformulate Eq.(2) into a soft version: where the indicator function in Eq.
(2) is replaced by a soft function ω(S|A). To emulate I(S ∈ S A ), ω(S|A) is expected to meet three requirements: There are several different definitions of ω(S|A) meeting these requirements, which are explored in Section 4.7.3. In this paper, ω(S|A) is defined as: T is the temperature, and Z(T ) = exp( 1 T ) is a normalization term that makes ω(A|A) = 1. If T → 0, ω(S|A) degenerates to the indicator function. If T > 0, ω(S|A) relates to the von Mises-Fishers distribution over the unit sphere in the feature space, where the acceptable feature vectors are distributed around the mean direction h A ||h A || . Since it is intractable to enumerate all possible answers in A, we convert Eq.(4) to an expectation over P LM (S|Q): where S 1 , · · · , S K are sentences sampled from P LM (·|Q), and K is the sample size. h A and h S i can be extracted from a pre-trained model, e.g., SentenceBERT (Reimers and Gurevych, 2019). From Eq. (7), we can see the semantic score s(A|Q) is only dependent on the semantic feature h A and regardless of A's surface form. Therefore, our method will produce similar semantic scores for synonymous choices, assuming that the synonymous choices have similar semantic features.

The Voting View of SEQA
At the beginning of Section 3.2, we define the semantic score as the summation of the conditional probabilities over the supporters. However, in Eq.(7), the sampled sentences S 1 , · · · , S K are not A's supporters because they may not be semantically similar to A. To address the differences, we Figure 2: Process of SEQA in the view of voting. We use the same templates with previous work (Shwartz et al., 2020;Tamborrino et al., 2020) to rewrite interrogative sentences into declarative ones. And then use GPT-2 to generate some plausible answers as voters S i , conditioned on the rewritten question. The choices and voters are encoded via SentenceRoBERTa to obtain semantic features, h Aj and h Si , which are then used to calculate the voting weights ω(S i |A j ). The choice with the largest score s(A j |Q) is selected as the answer. name the sampled sentences S 1 , · · · , S K as voters, which are plausible answers to the question Q. In this section, we will show another view of our method, which works like a procedure that the voters vote out the correct choice.
Suppose there are two candidate choices A 1 and A 2 , our method is to find the correct choice according to the semantic scores, s(A 1 |Q) and s(A 2 |Q). Following Eq.(6), our method can be decomposed into two steps: First, sample some voters S 1 , · · · , S K from P LM (·|Q). This step only considers the question Q but no candidate choices. Second, each voter votes for the choices with the semantic similarity weights. For example, S i votes for A j with the weight of ω(S i |A j ). The candidate choice that receives more votes will have a higher semantic score and be selected as the final answer. Figure 2 shows the process of SEQA in the view of voting. Although the voting view is intuitive, the formalism in Section 3.2 provides more insights: (1) Our method approximates the probability of semantics, which works as the theoretical basis of SEQA.
(2) Our method can be seen as an extension of Pro-A (see Table 1), since Pro-A only calculates the language model score for a single sentence, whereas our method calculates the semantic score for a set of supporters.
(3) Eq.(4) provides guidance, the three requirements mention before, for the design of the voting weight function ω(S|A). Specifically, the guidance explains the rationality of the formulation of Eq.(5).  Table 2: Evaluation results, including the original selection accuracy before attack, the accuracy after attack, the attack success rate, the percentage of perturbed words with respect to the original sentence length in successful attacks, and the semantic similarity between the original and paraphrased choices. GPT-2, RoBERTa and SRoBERTa refer to GPT-2-xlarge, RoBERTa-large  and SentenceRoBERTa-large, respectively.

Datasets
We conducted experiments on four multiplechoice commonsense question answering tasks, COPA (Roemmele et al., 2011), StoryClozeTest (SCT) (Mostafazadeh et al., 2016), SocialIQA (Sap et al., 2019b) and CosmosQA (Huang et al., 2019). For each instance, only one choice is correct. See Appendix for more description about datasets. For COPA, we reported the results on its test set. As the test sets of another three datasets are hidden, for convenience of analysis, we reported the experiment results on their development sets.

Baselines
We employed five strong baselines. Table 1 shows three of them, Pro-A, Pro-Q and MI-QA. There is no explicit auxiliary information used in these three methods, while another two baselines rely on explicit information supplementation. CGA (Bosselut and Choi, 2019) and Self-Talk (Shwartz et al., 2020) query pre-trained language models (e.g., GPT-2, COMET ) for relevant knowledge, which forms part of contexts. And then, similar to Pro-A, they take the generative probabilities of choices as scores.

Experiment Settings
For each method, we tried different pre-trained language models (see Appendix for details), and then selected the pre-trained LMs that maximized the accuracy on each dataset. The details of the selection of pre-trained LMs can be found in Table 2.
For SEQA, we used GPT-2 to generate voters via Nucleus Sampling (Holtzman et al., 2020) with p = 0.9. The sample size K of voters is set to 500. In Section 4.7.2, we show that a small sample size can also lead to superior performance. Self-Talk and CGA also rely on the generated answers from GPT-2 or COMET. Different from SEQA, for these two baselines, more generated answers will not always lead to better performance (see Section 4.7.2). Thus, we selected the optimal sample size for them rather than the same sample size with SEQA.
When evaluating SEQA on COPA, we tuned the temperature T on its development set, and then reported the results on the test set with the tuned temperature T = 0.1. Due to the absence of test sets of other datasets, we evaluated SEQA on their development sets without tuning the temperature and directly set T = 0.1. Table 2 shows the evaluation results about accuracy and robustness.

Accuracy
Among all the methods, SEQA achieved the best performance on all the datasets. Especially on SCT and CosmosQA, SEQA outperformed the best baselines by more than 10 points. It can be inferred that the semantic scores are beneficial for commonsense question answering due to the reduction of distracting factors. Pro-Q performed better than other baselines on COPA, perhaps because it suffered less from the statistic bias of choices (Tamborrino et al., 2020). However, Pro-Q lost its superiority on another three datasets, because it is unsuitable for processing long or complex contexts.

Robustness
To test the robustness under the synonym replacement attack, we used TextFooler (Jin et al., 2020) to attack the methods by perturbing the correct choices of the correctly predicted examples. The percentage of perturbed words refers to what percentage of words in choices are replaced in successful attacks. The semantic similarity is measured between the paraphrased choice and the original choice. Considering the attack success rate and the after-attack accuracy, SEQA is much more robust than all baselines. To be specific, the attack success rates on SEQA are at least 39 points lower than those of Pro-A, CGA, and Self-Talk on all datasets. MI-QA and Pro-Q are designed to reduce the impact of statistic bias in choices, so that they can resist lexical perturbation to some extent. Even so, SEQA is remarkably lower than MI-QA and Pro-Q in terms of attack success rates on all datasets.
An observation is that the attack success rate on SEQA on CosmosQA is higher than those on the other datasets. The reason is that, the contexts in CosmosQA are so complex that GPT-2 is more difficult to generate high-quality answers. If there is a more powerful generator, the robustness of SEQA is expected to have a further improvement.

Consistency Testing
We have claimed that a commonsense question answering method should assign close scores to synonymous choices. To verify that SEQA better meets this requirement, we conducted consistency testing for all the methods on four datasets. For each example, the consistency testing of a method is conducted in three steps: (1) Originally, the example has one correct and several wrong answer choices. We randomly sample some choices from other examples as additional wrong choices. After Method / Dataset COPA SCT SocialIQA CosmosQA Pro-A 9.1 11.0 11.7 9.4 Pro-Q 6.9 8.5 11.6 12.3 MI-QA 7.5 5.8 11.1 7.9 Self-Talk 13.3 9.5 10.7 10.1 CGA 9.7 11.0 10.9 9.5 SEQA 4.1 3.2 5.8 4.7 that, the example will have one correct choice and 19 wrong choices.
(2) Leverage a commonly used automatic translation service, Baidu Translation, to translate each choice from English into an intermediate language, and then back-translate it into English. During this process, we employ three intermediate languages, Chinese, Spanish, and Russian, because the translation quality of these languages is better than others. As a result, each choice is accompanied with three synonymous choices. (3) Use the commonsense question answering method to calculate the scores for each choice as well as its synonymous choices, and then sort all the choices according to their scores. Because the scoring scales of these methods are different, we calculate the standard deviation of the ranks of the correct choice and its synonymous choices. Table 3 shows the average standard deviation of the ranks. As expected, the average standard deviation of SEQA is much lower than any other method on all the datasets, confirming that SEQA assigns more similar ranks and closer scores to synonymous choices. We also observed that MI-QA provided relatively stable predictions compared with other baseline methods. A possible explanation is that, the normalization term P LM (A) helps alleviate the influence of lexical perturbations.

Trends of Accuracy with Answer Length
Answer length is also a type of distracting factor which may mislead baseline methods. To explore to which extent answer lengths affect the performance of methods, we divided the development set of CosmosQA into four subsets according to the length of correct choice. Table 4 shows the results of SEQA and a robust baseline, MI-QA. Compared with MI-QA, SEQA has much more stable performance as answer lengths vary. The reason is that, SEQA focuses on semantic information so that it has stronger resistance to such distracting factors.

Analysis on Temperature
In the previous experiments, the temperature T of SEQA was set to 0.1 by default. To investigate the influence of T , we varied T in a wide range from 0.05 to 10 and report the results in Table 5. Considering that the temperature varied greatly, the performance of SEQA is relatively stable, indicating that SEQA is not so sensitive to the selection of T . Another observation is that, although the four datasets are different in domains and text length, the trends of performance with temperature on them are relatively similar, illustrating that the temperature selected on one task can be generalized to other tasks. Figure 3 shows the effect of the sample size K on SEQA. For comparison, Figure 3 also includes the results of baselines in the settings of before-and after-attack, respectively. Due to the limitation of space, the results on the other datasets are shown in Appendix. As expected, the before-attack and afterattack accuracy on SCT increased with the sample size. In detail, the rapid increase in performance occurred when K < 100, and then the improvement slowed down when K > 100. Finally, SEQA achieved a stable and relatively high performance. CGA and Self-Talk also leverage LMs to generate some plausible answers. Different from our method, they use the generated answers to form part of the question, and then calculate the generative probability of the choice based on the augmented question. We also tried different sample sizes for the two methods, and Figure 3 (a) shows Figure 3: The before-attack (a) and after-attack accuracy (b) of methods with different sample sizes on SCT. The after-attack accuracy of Pro-A, CGA and Self-Talk is below 5.0%, and thus omitted in (b). that their accuracy will not stably increase with a larger sample size.

Analysis on ω(S|A)
ω(S|A) in SEQA can be defined in different forms, as long as the three requirements mentioned in Section 3.2 are met. Besides the default definition, we explored another three forms of ω(S|A), and the experiment results on COPA are shown in Table 6. Although the performance varies with ω(S|A), the before-attack accuracy of SEQA still outperformed most of the baselines with any definition of ω(S|A). Moreover, SEQA maintains its obvious advantage in after-attack accuracy, which reflects the inherent robustness of SEQA. GPT Table 7: SEQA's accuracy with different feature extractors and language models on COPA. Avg. GloVe means the average pooling of the pre-trained word embeddings (Pennington et al., 2014) over the sentence.  Table 8: Manual evaluation of the quality of voters (generated by GPT-2-xlarge conditioned on questions). Score 3/2/1 correspond to high, middle and low quality, respectively, in terms of grammar and logicality.

Analysis on Pre-trained Language
Model and Feature Extractor SEQA has no limit on the selection of the pretrained language model and the feature extractor. Table 7 shows how the accuracy of SEQA on COPA varied with the language model and the feature extractor. As expected, more powerful extractor usually led to higher accuracy under the same settings of language models. Similar conclusion can be obtained for the language model. It can be inferred that, if there are more powerful language models or feature extractors in the future, the performance of SEQA may be further improved.

Analysis on the Quality of Voters
While the performance of SEQA served as an extrinsic evaluation for the quality of the voters (plausible answers sampled from P LM (·|Q), described in Section 3.3), we were also interested in evaluating it intrinsically. We sampled 125 voters from COPA. For each voter, we provided crowdsourcing workers with the original question, and asked them: 1) whether the voter is grammatical, not entirely grammatical but understandable, or completely not understandable, 2) whether the voter is a reasonable answer to the question, not reasonable but relevant, or completely irrelevant. These evaluation tasks comprehensively examined the voters in grammar and logicality. The annotation tasks were carried out in Amazon Mechanical Turk, and we aggregated annotations from 3 workers using majority vote. Table 8 shows the results of the human evaluation of the voters. Score 3/2/1 correspond to the high, middle and low quality, respectively. According to the grammar scores, 97.6% of the voters are grammatical or at least understandable, for which most of the voters belong to the natural language space. In terms of logicality, 40.8% of the voters are reasonable answers to the questions, which may not be very satisfying. However, in Section 4.9, we will show that SEQA makes prediction based on a small part of voters, and hence SEQA is robust even though there are some irrelevant voters.

Voting Weight Distribution
We visualize the cumulative proportion of voters favoring the correct or the wrong choices (see Figure 4). The curve is averaged over all instances in the test set of COPA, where we sampled 500 voters for each instance and set T = 0.1.
From the curves, we can find several properties of voters: (1) The voters favor the correct choices over the wrong choices, where the curve for correct choices is consistently above the curve for wrong ones. The area between two curves shows the difference of semantic scores s(A C |Q) − s(A W |Q), which is a large gap compared with the area under the bottom curve. (2) 93.5% of voters do not strongly favor any choices (|ω(S|A C ) − ω(S|A W )| < 0.05), indicating that they are semantically irrelevant to both candidate choices. However, Table 8 shows that 40.8% of voters are logically reasonable, so many voters are reasonable but irrelevant to both answers. It suggests that there can be several reasonable answers for a single question, and the sampled voters are diverse in the semantics. (3) Although there are only 5.3% of voters strongly favoring the correct choices, there are much less voters (1.2%) favoring the wrong ones. It explains why our method is able to predict the correct answer.
To help understand the relationship between voters and choices, Table 9 provides an instance with voters and their voting weights to the choices. We show four types of voters: favoring the correct choice, favoring the wrong choice, logically reasonable but not favoring either choices, and unreasonable and irrelevant to both choices. We can see  Table 9: An example of voters as well as their voting weights. A C is the correct choice, while A W is wrong. S i refers to a voter.
that the last two types of voters can hardly affect the method's prediction, because their voting weights are much smaller than the first two types of voters.

Conclusion
We present a semantic-based question answering method, SEQA, which can answer commonsense questions more accurately and robustly in an unsupervised setting. Instead of directly scoring each answer choice, our method focuses on the probability of observing a choice's semantics. In the view of voting, SEQA first generates some plausible answers (voters) and then utilizes them to vote for the correct choice by considering the semantic similarity between each choice and each voter. Experiment results show that SEQA achieves the best performance on four datasets, and it is remarkably more robust than all the baselines when being attacked by TextFooler.

A Datasets
The four datasets used in this work are multiplechoice commonsense question answering tasks. COPA 2 (Roemmele et al., 2011) evaluates the ability of causal reasoning about a certain event, which is expressed in a simple sentence. Each question is accompanied with two candidate choices.
StoryClozeTest (SCT) 3 (Mostafazadeh et al., 2016) requires models to select the reasonable story ending, from two alternatives, conditioned on a description about the story context.
SocialIQA 4 (Sap et al., 2019b) evaluates the reasoning ability on social events. In each example, the question describes a social event and asks models to make some inferences based on the event, such as its cause or effect.
CosmosQA 5 (Huang et al., 2019) is a reading comprehension task. Different from the three datasets above, the examples of CosmosQA have long and complex contexts. The original dataset contains a type of choices "None of the above" to test whether models can identify unanswerable questions. This is not the focus of our work, so we removed such choices.
For COPA, we reported the results on its test set. As the test sets of SCT, SocialIQA and Cos-mosQA are hidden, for convenience of analysis, we reported the experiment results on their development sets. See Table 10 for statistic information of each dataset.

B Templates for Rewriting Questions
We use the same templates for our method and all the baselines. Note that the templates for rewriting questions is not the focus of this paper, and we inherit the templates from previous work if available. Tamborrino et al. (2020) provides templates for COPA (Table 11) and Shwartz et al. (2020) provides templates for SocialIQA (Table 12). Since the instances in SCT have no questions, SCT does not need templates. There is no related work discussing templates for CosmosQA, so we design some templates by ourselves (Table 13). Source code for rewriting questions and SEQA will be made publicly available.

C Selection of Pre-trained Models
For each method, we tried to adopt different pretrained models and find the pre-trained models that maximized the accuracy on the development set of each dataset. Table 14 shows the set of candidate pre-trained models for each method, with the selected models in bold. Because of the nature of Pro-Q, it can only use bidirectional language models, so we only evaluated Pro-Q with RoBERTa-large and SentenceRoBERTa-large.
As shown in Table 14, for each method except CGA, the best selection of pre-trained models is consistent on all the datasets. CGA achieved its best performance with COMET on SocialIQA and with GPT2-xlarge on the other datasets.

D Hyperparameter Search
For SEQA, we only tuned the temperature T . To be more specific, we selected T from five candidate values according to the accuracy on the development set of COPA. Table 15 shows that SEQA with T = 0.1 achieved the best performance on the development set of COPA. And then we evaluated SEQA with T = 0.1 on the test set of COPA as well as the development sets of SCT, SocialIQA and CosmosQA. Figure 5,6,7 shows the effect of the sample size K on SEQA. For comparison, these figures also include the results of baselines in the settings of before-and after-attack, respectively. On the overall trend, the performance of SEQA improved as   [SUBJ] refers to a subject. There are two groups of templates, Rewrite1 for GPT-2 and Rewrite2 for COMET . The relations in Rewrite2 are defined in Sap et al. (2019a) and used for training COMET. These templates are inherited from Shwartz et al. (2020). More details can be found in Shwartz et al. (2020) and https://github.com/vered1986/self talk. the sample size increased. Another observation is that a smaller sample size can already make SEQA outperform most baseline methods. Figure 5: The before-attack (a) and after-attack accuracy (b) of methods with different sample sizes on COPA. The after-attack accuracy of Pro-A, CGA and Self-Talk is below 10.0%, and thus omitted in (b). Figure 6: The before-attack (a) and after-attack accuracy (b) of methods with different sample sizes on So-cialIQA. The after-attack accuracy of Pro-A, CGA and Self-Talk is below 20.0%, and thus omitted in (b). Figure 7: The before-attack (a) and after-attack accuracy (b) of methods with different sample sizes on Cos-mosQA. The after-attack accuracy of Pro-A, CGA and Self-Talk is below 2.0%, and thus omitted in (b).