A Mutual Information Maximization Approach for the Spurious Solution Problem in Weakly Supervised Question Answering

Weakly supervised question answering usually has only the final answers as supervision signals while the correct solutions to derive the answers are not provided. This setting gives rise to the spurious solution problem: there may exist many spurious solutions that coincidentally derive the correct answer, but training on such solutions can hurt model performance (e.g., producing wrong solutions or answers). For example, for discrete reasoning tasks as on DROP, there may exist many equations to derive a numeric answer, and typically only one of them is correct. Previous learning methods mostly filter out spurious solutions with heuristics or using model confidence, but do not explicitly exploit the semantic correlations between a question and its solution. In this paper, to alleviate the spurious solution problem, we propose to explicitly exploit such semantic correlations by maximizing the mutual information between question-answer pairs and predicted solutions. Extensive experiments on four question answering datasets show that our method significantly outperforms previous learning methods in terms of task performance and is more effective in training models to produce correct solutions.


Introduction
Weakly supervised question answering is a common setting of question answering (QA) where only final answers are provided as supervision signals while the correct solutions to derive them are not. This setting simplifies data collection, but exposes model learning to the spurious solution problem: there may exist many spurious ways to derive the correct answer, and training a model with spurious solutions can hurt model performance (e.g., misleading the model to produce unreasonable solutions or wrong answers). As shown in Fig 1,  Figure 1: Examples from three weakly supervised QA tasks, i.e., multi-mention reading comprehension, discrete reasoning, and semantic parsing. Spans in dark gray and green denote semantic correlations between a question and its solution, while spans in orange are spurious information and should not be used in a solution.
for multi-mention reading comprehension, many mentions of an answer in the document(s) are irrelevant to the question; for discrete reasoning tasks or text2SQL tasks, an answer can be produced by the equations or SQL queries that do not correctly match the question in logic.
Some previous works heuristically selected one possible solution per question for training, e.g., the first answer span in the document (Joshi et al., 2017;Tay et al., 2018;Talmor and Berant, 2019); some treated all possible solutions equally and maximized the sum of their likelihood (maximum marginal likelihood, or MML) (Swayamdipta et al., 2018;Clark and Gardner, 2018;; many others selected solutions according to model confidence (Liang et al., 2018;Min et al., 2019), i.e., the likelihood of the solutions being derived by the model. A drawback of these methods is that they do not explicitly consider the mutual semantic correlations between a question and its solution when selecting solutions for training.
Intuitively speaking, a question often contains vital clues about how to derive the answer, and a wrong solution together with its context often fails to align well with the question. Take the discrete reasoning case in Fig 1 as an example. To answer the question, we need to know the start year of the Battle of Powder River, which is answered by the first 1876; the second 1876 is irrelevant as it is the year of an event that happened during the battle.
To exploit the semantic correlations between a question and its solution, we propose to maximize the mutual information between question-answer pairs and model-predicted solutions. As demonstrated by Min et al. (2019), for many QA tasks, it is feasible to precompute a modestly-sized, taskspecific set of possible solutions containing the correct one. Therefore, we focus on handling the spurious solution problem under this circumstance. Specifically, we pair a task-specific model with a question reconstructor and repeat the following training cycle (Fig 2): (1) sample a solution from the solution set according to model confidence, train the question reconstructor to reconstruct the question from that solution, and then (2) train the task-specific model on the most likely solution according to the question reconstructor. During training, the question reconstructor guides the taskspecific model to predict those solutions consistent with the questions. For the question reconstructor, we devise an effective and unified way to encode solutions in different tasks, so that solutions with subtle differences (e.g., different spans with the same surface form) can be easily discriminated.
Our contributions are as follows: (1) We propose a mutual information maximization approach for the spurious solution problem in weakly supervised QA, which exploits the semantic correlations between a question and its solution; (2) We conducted extensive experiments on four QA datasets. Our approach significantly outperforms strong baselines in terms of task performance and is more effective in training models to produce correct solutions.

Related Work
Question answering has raised prevalent attention and has achieved great progress these years. A lot of challenging datasets have been constructed to advance models' reasoning abilities, such as (1) reading comprehension datasets with extractive answer spans (Joshi et al., 2017;Dhingra et al., 2017), with free-form answers (Kociský et al., 2018), for multi-hop reasoning (Yang et al., 2018), or for discrete reasoning over paragraphs (Dua et al., 2019), and (2) datasets for semantic parsing (Pasupat and Liang, 2015;Zhong et al., 2017;Yu et al., 2018). Under the weakly supervised setting, the specific solutions to derive the final answers (e.g., the correct location of an answer text, or the correct logic executing an answer) are not provided. This setting is worth exploration as it simplifies annotation and makes it easier to collect large-scale corpora. However, this setting introduces the spurious solution problem, and thus complicates model learning.
Most existing approaches for this learning challenge include heuristically selecting one possible solution per question for training (Joshi et al., 2017;Tay et al., 2018;Talmor and Berant, 2019), training on all possible solutions with MML (Swayamdipta et al., 2018;Clark and Gardner, 2018;, reinforcement learning (Liang et al., 2017(Liang et al., , 2018, and hard EM (Min et al., 2019;. All these approaches either use heuristics to select possibly reasonable solutions, rely on model architectures to bias towards correct solutions, or use model confidence to filter out spurious solutions in a soft or hard way. They do not explicitly exploit the semantic correlations between a question and its solution. Most relevantly, Cheng and Lapata (2018) focused on text2SQL tasks; they modeled SQL queries as the latent variables for question generation, and maximized the evidence lower bound of log likelihood of questions. A few works treated solution prediction and question generation as dual tasks and introduced dual learning losses to regularize learning under the fully-supervised or the semi-supervised setting (Tang et al., 2017;Cao et al., 2019;Ye et al., 2019). In dual learning, a model generates intermediate outputs (e.g., the task-specific model predicts solutions from a question) while the dual model gives feedback signals (e.g., the question reconstructor computes the likelihood of the question conditioned on predicted solutions). This method is featured in three aspects. First, both models need training on fully-annotated data so that they can produce reasonable intermediate outputs. Second, the intermediate outputs can introduce noise during learning as they are sampled from models but not restricted to solutions with correct answer or valid questions. Third, this method typically updates both models with reinforcement learning while the rewards provided by a dual model can be unstable or of high variance. By contrast, we focus on the spurious solution problem under the weakly supervised setting and propose a mutual information maximization approach. Solutions used for training are restricted to those with correct answer. What's more, though a taskspecific model and a question reconstructor interact with each other, they do not use the likelihood from each other as rewards, which can stabilize learning.

Task Definition
For a QA task, each instance is a tuple d, q, a , where q denotes a question, a is the answer, and d is reference information such as documents for reading comprehension, or table headers for semantic parsing. A solution z is a task-specific derivation of the answer, e.g., a particular span in a document, an equation, or a SQL query (as shown in Fig 1). Let f (·) be the task-specific function that maps a solution to its execution result, e.g., by returning a particular span, solving an equation, or executing a SQL query. Our goal is to train a task-specific model P θ (z|d, q) that takes d, q as input and predicts a solution z satisfying f (z) = a.
Under the weakly supervised setting, only the answer a is provided for training while the groundtruth solutionz is not. We denote the set of possible solutions as Z = {z|f (z) = a}. In cases where the search space of solution is large, we can usually approximate Z so that it contains the ground-truth solutionz with a high probability (Min et al., 2019;. Note that Z is task-specific, which will be instantiated in section 4. During training, we pair the task-specific model P θ (z|d, q) with a question reconstructor P φ (q|d, z) and maximize the mutual information between q, a and z. During test, given d, q , we use the taskspecific model to predict a solution and return the execution result.

Learning Method
Given an instance d, q, a , the solution set Z usually contains only one solution that best fits the instance while the rest are spurious. We propose to exploit the semantic correlations between a ques-A Case of Discrete Reasoning over Paragraphs Question q = "How many years after the Battle of Powder River did Powerville Montana become the first establishment in the county?" Answer a = "2" Paragraph d = "… On March 17, ①1876, the Battle of Powder River occurred in the southcentral part of the county ... In June ②1876 six companies of … On November 1, ③1878, Powderville, Montana became the first establishment in the county…" tion and its solution to alleviate the spurious solution problem via mutual information maximization.
Our objective is to obtain the optimal taskspecific model θ * that maximizes the following conditional mutual information: where I θ ( q, a ; z|d) denotes conditional mutual information between q, a and z over P (d, q, a)P θ (z|d, q, a). H(·|·) is conditional entropy of random variable(s). P (d, q, a) is the probability of an instance from the training distribution. P θ (z|d, q, a) is the posterior prediction probability of z (∈ Z) which is the prediction probability P θ (z|d, q) normalized over Z: Note that computing P θ (q, a|d, z) is intractable. We therefore introduce a question reconstructor P φ (q|d, z) and approximate P θ (q, a|d, z) with I(f (z) = a)P φ (q|d, z) where I(·) denotes indicator function. Eq. 1 now becomes: To optimize Eq. 3 is to repeat the following training cycle, which is analogous to the EM algorithm: 1. Minimize L 2 w.r.t. the question reconstructor φ to draw P φ (q|d, z) close to P θ (q, a|d, z), by sampling a solution z ∈ Z according to its posterior prediction probability P θ (z|d, q, a) (see Eq. 2) and maximizing log P φ (q|d, z ).  2. Maximize L 1 w.r.t. the task-specific model θ. L 1 can be seen as a reinforcement learning objective with log P φ (q|d, z) being the reward function. During training, the reward function is dynamically changing and may be of high variance. As we can compute the reward for all z ∈ Z, we therefore adopt a greedy but more stable update method, i.e., to maximize log P θ (z |d, q) where z = arg max z∈Z log P φ (q|d, z) is the best solution according to the question reconstructor.
We illustrate the above training cycle in Fig 2.

Question Reconstructor
The question reconstructor P φ (q|d, z) takes reference information d and a solution z as input, and reconstructs the question q. We use BART base , a pre-trained Seq2Seq model, as the question reconstructor so that semantic correlations between questions and solutions can be better captured.
A solution typically consists of task-specific operation token(s) (e.g., COUNT for discrete reasoning or semantic parsing), literal(s) (e.g., numeric constants for discrete reasoning or semantic parsing), or span(s) from a question or reference information (e.g., for most QA tasks). It is problematic to just feed the concatenation of d and the surface form of z to the BART encoder; otherwise, different spans with the same surface form can no longer be discriminated as their contextual semantics are lost. To effectively encode d and z, we devise a unified solution encoding as in Fig 3 which is applicable to solutions of various types. Specifically, we leave most of the surface form of z unchanged, except that we replace any span from reference information with a placeholder span . The representation of span is computed by forcing it to only attend to the contextual representation(s) of the referred span. To obtain disentangled and robust representations of reference information and a solution, we keep reference information and the solution (except for the token span ) from attending to each other. Intuitively speaking, semantics of reference information should not be affected by a solution, and the representations of a solution should largely determined by its internal logic.

Solution Set
While our learning method and question reconstructor are task-agnostic, solutions are usually taskspecific. Precomputing solution sets needs formal definitions of solutions which define the search space of solutions. A possible search method is to exhaustively enumerate all solutions that produce the correct answer. We will introduce the definitions of solutions for different tasks in section 4. Following Min et al. (2019), we conducted experiments on three QA tasks, namely multi-mention reading comprehension, discrete reasoning over paragraphs, and semantic parsing. This section introduces baselines, the definitions of solutions in different tasks, how the solution set can be precomputed, and our experimental results. Statistics of the datasets we used are presented in Table 1.

Experiments
For convenience, we denote reference information as d = [d 1 , d 2 , ..., d |d| ] and denote a question as q = [q 1 , q 2 , ..., q |q| ] where d i and q j are a token of d and q respectively. A span from reference information and a question span is represented as (s, e) d and (s, e) q respectively, where s and e are start and end index of the span respectively.

Baselines
First Only (Joshi et al., 2017) which trains a reading comprehension model by maximizing log P θ (z|d, q) where z is the first answer span in d. MML (Min et al., 2019) which maximizes log z∈Z P θ (z|d, q).
HardEM (Min et al., 2019) which maximizes log max z∈Z P θ (z|d, q). HardEM-thres : a variant of HardEM that optimizes only on confident solutions, i.e., to maximize max z∈Z I(P θ (z|d, q) > γ) log P θ (z|d, q) where γ is an exponentially decaying threshold. γ is initialized such that a model is trained on no less than half of training data at the first epoch. We halve γ after each epoch. VAE (Cheng and Lapata, 2018): a method that views a solution as the latent variable for question generation and adopts the training objective of Variational Auto-Encoder (VAE) (Kingma and Welling, 2014) to regularize the task-specific model. The overall training objective is given by: where θ denotes a task-specific model and φ is our question reconstructor. L mle (θ) is the total log likelihood of the set of model-predicted solutions (denoted by B) which derive the correct answer. L vae (θ, φ) is the evidence lower bound of the log likelihood of questions. λ is the coefficient of L vae (θ, φ). This method needs pre-training both θ and φ before optimizing the overall objective L(θ, φ). Notably, model θ optimizes on L vae (θ, φ) via reinforcement learning. We tried stabilizing training by reducing the variance of rewards and setting a small λ.

Multi-Mention Reading Comprehension
Multi-mention reading comprehension is a natural feature of many QA tasks. Given a document d and a question q, a task-specific model is required to locate the answer text a which is usually mentioned many times in the document(s). A solution is defined as a document span. The solution set Z is computed by finding exact match of a: We experimented on two open domain QA datasets, i.e., Quasar-T (Dhingra et al., 2017) and WebQuestions (Berant et al., 2013). For Quasar-T, we retrieved 50 reference sentences from ClueWeb09 for each question; for WebQuestions, we used the 2016-12-21 dump of Wikipedia as the knowledge source and retrieved 50 reference paragraphs for each question using a Lucene index system. We used the same BERT base (Devlin et al., 2019) reading comprehension model and data preprocessing from (Min et al., 2019).  Results: Our method outperforms all baselines on both datasets ( Table 2). The improvements can be attributed to the effectiveness of solution encoding, as solutions for this task are typically different spans with the same surface form, e.g., in Qusart-T, all z ∈ Z share the same surface form.

Discrete Reasoning over Paragraphs
Some reading comprehension tasks pose the challenge of comprehensive analysis of texts by requiring discrete reasoning (e.g., arithmetic calculation, sorting, and counting) (Dua et al., 2019). In this task, given a paragraph d and a question q, an answer a can be one of the four types: numeric value, a paragraph span or a question span, a sequence of paragraph spans, and a date from the paragraph. The definitions of z depend on answer types (Table  4). These solutions can be searched by following . Note that some solutions involve numbers in d. We treated those numbers as spans while reconstructing q from z. We experimented on DROP (Dua et al., 2019). As the original test set is hidden, for convenience of  Table 3: Evaluation on DROP. We used the public development set of DROP as our test set. We also provide performance breakdown of different question types on our test set. Results on the overall test set marked with ‡ are significantly worse than the best one (t-test, p-value < 0.05).

Numeric Answers
Arithmetic analysis, we used the public development set as our test set, and split the public train set into 90%/10% for training and development. We used Neural Symbolic Reader (NeRd)  as the taskspecific model. NeRd is a Seq2Seq model which encodes a question and a paragraph, and decodes a solution (e.g., count (paragraph span(s 1 , e 1 ), paragraph span(s 2 , e 2 )) where paragraph span(s i , e i ) means a paragraph span starting at s i and ending at e i ). We used the precomputed solution sets provided by  1 . Data preprocessing 1 Our implementation of NeRd has four major differences from that of . (1) Instead of choosing BERTlarge as encoder, we chose the discriminator of Electrabase (Clark et al., 2020) which is of a smaller size. (2) We did not use moving averages of trained parameters. (3) We did not use the full public train set for training but used 10% of it for development. (4) For some questions, it is hard to guarantee that a precomputed solution set covers the ground-truth solution. For example, the question How many touchdowns did was also kept the same. Results: As shown in Table 3, our method significantly outperforms all baselines in terms of F1 score on our test set.
We also compared our method with the baseline VAE which uses a question reconstructor φ to adjust the task-specific model θ via maximizing a variational lower bound of log P (q|d) as the regularization term L vae (θ, φ). To pre-train the task-specific model for this method, we simply obtained the best task-specific model trained with HardEM-thres. VAE optimizes the task-specific model on L vae (θ, φ) with reinforcement learning where P φ (q|d, z) is used as learning signals for the task-specific model. Despite our efforts to stabilize training, the F1 score still dropped to 36.28 after optimizing the overall objective L(θ, φ) for 1,000 steps. By contrast, our method does not use P φ (q|d, z) to compute learning signals for the taskspecific model but rather uses it to select solutions to train the task-specific model, which makes a better use of the question reconstructor.

Semantic Parsing
Text2SQL is a popular semantic parsing task. Given a question q and a table header d = [h 1 , ..., h L ] where h l is a multi-token column, a parser is required to parse q into a SQL query z and return the execution results. Under the weakly supervised setting, only the final answer is provided while the SQL query is not. Following Min et al. (2019), Z is approximated as a set of non-nested SQL queries with no more than three conditions: k=1 )|f (z) = a, z sel ∈ {h1, ..., hL}, z cond k ∈ {none} ∪ C, z agg ∈ {none, sum, mean, max, min, count}} Brady throw? needs counting, but the related mentions are not known.  partly solved this problem by adding model-predicted solutions (with correct answer) into the initial solution sets as learning proceeds. In this paper, we kept the initial solution sets unchanged during training, so that different QA tasks share the same experimental setting.
where z agg is an aggregating operator and z sel is the operated column (a span of d). C = {(h, o, v)} is the set of all possible conditions, where h is a column, o ∈ {=, <, >}, and v is a question span.
We experimented on WikiSQL (Zhong et al., 2017) under the weakly supervised setting 2 . We chose SQLova (Hwang et al., 2019) as the taskspecific model which is a competitive text2SQL parser on WikiSQL. Hyperparameters were kept the same as in (Hwang et al., 2019). We used the solution sets provided by Min et al. (2019). Results: All models in Table 5 do not apply execution-guided decoding during inference. Our method achieves new state-of-the-art results under the weakly supervised setting. Though without supervision of ground-truth solutions, our execution accuracy (i.e., accuracy of execution results) on the test set is close to that of the fully supervised SQLova. Notably, GRAPPA focused on representation learning and used a stronger task-specific model while we focus on the learning method and outperform GRAPPA with a weaker model.

Performance on Test Data with Different
Size of Solution Set

Effect of |Z| at Training
The more complex a question is, the larger the set of possible solutions tends to be, the more likely a model will suffer from the spurious solution problem. We therefore investigated whether our learning method can deal with extremely noisy solution sets. Specifically, we extracted a hard train set from the original train set of WikiSQL. The hard train set consists of 10K training data with the largest Z. The average size of Z on the hard train set is 1,554.6, much larger than that of the original train set (315.4). We then compared models trained on the original train set and the hard train set using different learning methods.  Table 5: Evaluation on WikiSQL. Accuracy that is significantly lower than the highest one is marked with † for p-value < 0.1, and ‡ for p-value < 0.05 (t-test).  As shown in Fig 5, models trained with our method consistently outperform baselines in terms of logical form accuracy (i.e., accuracy of predicted solutions) and execution accuracy. When using the hard train set, the logical form accuracy of models trained with HardEM or HardEM-thres drop to below 14%. Compared with HardEM, HardEM-thres is better when trained on the original train set but is worse when trained on the hard train set. These indicate that model confidence can be unreliable and thus insufficient to filter out spurious solutions. By contrast, our method explicitly exploits the semantic correlations between a question and a solution, thus much more resistant to spurious solutions.

Effect of the Question Reconstructor
As we used BART base as the question resconstructor, we investigated how our question reconstructor contributes to performance improvements. We first investigated whether BART base itself is less affected by the spurious solution problem than the task-specific models. Specifically, we viewed text2SQL as a sequence generation task and finetuned a BART base on the hard train set of WikiSQL with HardEM. The input of BART shares the same format as that of SQLova, which is the concatenation of a question and a table header. The output of BART is a SQL query. Without constraints on decoding, BART might not produce valid SQL queries. We therefore evaluated models on a SQL selection task instead: for each question in the development set of WikiSQL, a model picks out the correct SQL from at most 10 candidates by selecting the one with the highest prediction probability. As shown in Table 6, when trained with HardEM, both BART base parser and SQLova perform similarly, and underperform our method by a large margin. This indicates that using BART base as a task-specific model can not avoid the spurious solution problem. It is our mutual information maximization objective that makes a difference.  We further investigated the effect of the choice of question reconstructor. We compared BART base with two alternatives: (1) T-scratch: a three-layer Transformer (Vaswani et al., 2017) without pre-training and (2) T-DAE: a three-layer Transformer pre-trained as a denoising auto-encoder of questions on the train set; the text infilling pre-training task for BART was used. As shown in Table 7, our method with either of the three question reconstructors outperforms or is at least competitive with baselines, which verifies the effectiveness of our mutual information maximization objective. What's more, using T-DAE is competitive with BART base , indicating that our training objective is compatible with other choices of question reconstructor besides BART, and that using a denoising auto-encoder to initialize the question reconstructor may be beneficial to exploit the semantic correlations between a question and its solution.

Evaluation of Solution Prediction
As solutions with correct answer can be spurious, we further analyzed the quality of predicted solutions. We randomly sampled 50 test examples from DROP for which our method produced the correct answer, and found that our method also produced the correct solution for 92% of them.
To investigate the effect of different learning methods on models' ability to produce correct solutions, we manually analyzed another 50 test samples for which HardEM, HardEM-thres, and our method produced the correct answer with different solutions. The percentage of samples for which our method produced the correct solution is 58%, much higher than that of HardEM (10%) and HardEMthres (30%). For experimental details, please refer to the appendix.

Case Study
Fig 6 compares NeRd predictions on four types of questions from DROP when using different learning methods. An observation is that NeRd using our method shows more comprehensive understanding of questions, e.g., in the Arithmetic case, NeRd using our method is aware of the two key elements in the question including the year when missionaries arrived in Ayutthaya and the year when the Seminary of Saint Joseph was built, while NeRd using HardEM-thres misses the first element. What's more, NeRd using our method is more precise in locating relevant information, e.g., in the first Sorting case, NeRd with our method locates the second appearance of 2 whose contextual semantics matches the question, while NeRd using HardEM-thres locates the first appearance of 2 which is irrelevant.

Span(s)
Question: Which team attempted a 2-point conversion? Answer: Rams Paragraph: Hoping to rebound from their road loss to the Patriots, the ①Rams went home for a Week 9 NFC West duel with the Arizona Cardinals … In the second quarter, the Cardinals responded with a vengeance as safety Antrel Rolle returned an interception 40 yards for a touchdown, kicker Neil Rackers got a 36-yard field goal, RB Tim Hightower got a 30-yard TD run, and former ②Rams QB Kurt Warner completed a 56-yard TD pass to WR Jerheme Urban. In the third quarter, Arizona increased its lead as Warner completed a 7-yard TD pass to WR Anquan Boldin. In the fourth quarter, the ③Rams tried to come back as Bulger completed a 3-yard TD pass to WR Torry Holt (with a failed 2-point conversion). However, the Cardinals flew away as Rackers nailed a 30-yard field goal. During the game, the ④Rams inducted former Head Coach Dick Vermeil (who helped the franchise win Super Bowl XXXIV) onto the ⑤ Rams Ring of Honor.

Sorting
Question: How many yards was the shortest touchdown pass? Answer: 2 Paragraph: The Giants played their Week ①2 home opener against the Green Bay Packers … The Giants responded with a 26-yard scoring strike by Eli Manning to Plaxico Burress. The Giants got a Lawrence Tynes field goal and a 10-7 half time lead. In the second half, the Packers drove 51 yards to start the second half. Favre capped off the scoring drive with a ②2-yard pass to Bubba Franks for a 14-10 lead the Packers would not relinquish… Model Prediction: Ours: min{②2} ✓  Figure 6: NeRd predictions on four types of questions from DROP when using different learning methods. Spans in dark gray and green denote semantic correlations between a question and its solution, while spans in orange are spurious information and should not be used in a solution.

Conclusion
To alleviate the spurious solution problem in weakly supervised QA, we propose to explicitly exploit the semantic correlations between a question and its solution via mutual information maximization. During training, we pair a task-specific model with a question reconstructor which guides the task-specific model to predict solutions that are consistent with the questions. Experiments on four QA datasets demonstrate the effectiveness of our learning method. As shown by automatic and manual analyses, models trained with our method are more resistant to spurious solutions during training, and are more precise in locating information that is relevant to the questions during inference, leading to higher accuracy of both answers and solutions.

Acknowledgements
This work was partly supported by the NSFC projects (Key project with No. 61936010 and regular project with No. 61876096). This work was also supported by the Guoqiang Institute of Tsinghua University, with Grant No. 2019GQG1 and 2020GQG0005. Dragomir R. Radev. 2018 Min et al. (2019) to apply annealing to HardEM on reading comprehension tasks: at the training step t, a model optimizes MML objective with a probability of min(t/τ, 0.8) and optimizes HardEM objective otherwise. τ was chosen from {10K, 20K, 30K, 40K, 50K} based on model performance on the development set. HardEM-thres: We set the confidence threshold as γ = 0.5 n where n was initialized as follows: we first computed the prediction probability of each solution with a task-specific model, and then set n to a value such that the model was trained on no less than half of training data at the first epoch. We halved γ after each epoch. VAE (Cheng and Lapata, 2018): A method that views a solution as the latent variable for question generation and adopts the training objective of Variational Auto-Encoder (VAE) to regularize the task-specific model. The overall training objective is given by: where L mle (θ) is the total log likelihood of the set of model-predicted solutions (denoted by B) with correct answer. L vae (θ, φ) is the evidence lower bound of the log likelihood of questions. λ is the coefficient of L vae (θ, φ). The optimization process is divided into three stages: (1) the 1 st stage pre-trains a task-specific model θ with HardEMthres on solution sets 4 ; (2) the 2 nd stage pairs the task-specific model with our question reconstructor φ to optimize L(θ, φ) for one epoch, except that L vae (θ, φ) is used to pre-train φ and is kept from back-propagating to θ; (3) the 3 rd stage optimizes L(θ, φ) while allowing L vae (θ, φ) to backpropagate to θ. The gradient of L vae (θ, φ) w.r.t. θ is given by: where R is the reward function. To stabilize training, we use the average reward of 5 sampled so-lutions as a baseline b and re-define the reward function as R = R − b. λ is set to 0.1. In section 4.3, we report performance of the best model in the 3 rd stage. At the 2 nd stage, as the task-specific model optimized on both correct solutions and spurious solutions equally, the F1 score dropped from 72.35 to 67.93 at the end of this stage, indicating that correct training solutions is vital for generalization. At the 3 rd stage, model learning was further regularized with L vae (θ, φ) which was optimized via reinforcement learning. Despite our efforts to stabilize training, the F1 score still dropped to 36.28 after training for 1,000 steps at the 3 rd stage.

A.2 Experimental Settings
For all experiments, we used previously proposed task-specific models and optimized them with their original optimizer. We chose the best task-specific model according to its performance on the development set. As for our learning method, we used BART base as the question reconstructor. AdamW optimizer (Loshchilov and Hutter, 2019) was used to update the question reconstructor with learning rate set to 5e-5.

A.2.1 Multi-mention Reading Comprehension
We adopted the reading comprehension model, data preprocessing, and training configurations from Min et al. (2019). Task-specific model: The model is based on uncased version of BERT base , which takes as input the concatenation of a question and a paragraph, and outputs the probability distribution of the start and end position of the answer span. To deal with multiparagraph reading comprehension, it also trains a paragraph selector; during inference, it outputs a span from the paragraph ranked 1 st . Data Preprocessing: Documents are split to segments up to 300 tokens. For Quasar-T, as retrieved sentences are short, we concatenated all sentences into one document in decreasing order of retrieval score (i.e., relevance with the question); for WebQuestions, we concatenated 5 retrieved paragraphs into one document, resulting in 10 reference documents per question.
Training: Batch size is 20. BertAdam optimizer was used to update the reading comprehension model with learning rate set to 5e-5. The number of training epochs is 10.

A.2.2 Discrete Reasoning over Paragraphs
We used NeRd  for discrete reasoning. The major differences with its original implementation have been discussed in section 4.3. Task-specific Model:  have designed a domain-specific language for discrete reasoning on DROP. The definitions of solutions for discrete reasoning introduced in section 4.3 are also expressed in this language except that we use different symbols (e.g., the minus sign "-" in our definitions has the same meaning as the symbol "DIFF" in their paper). NeRd is a Seq2Seq model which tasks as input the concatenation of a question and a paragraph, and generates the solution as a sequence. The answer is obtained by executing the solution. Data Preprocessing: The input of the task-specific model is truncated to up to 512 words. We used the solution sets provided by , which cover 93.2% of examples in the train set.
Training: Batch size is 32. Adam optimizer (Kingma and Ba, 2015) was used to update NeRd with learning rate set to 5e-5. The number of training epochs is 20. Task-specific Model: SQLova encodes the concatenation of a question and a table header with uncased BERT base , and outputs a SQL query via slot filling with an NL2SQL (natural language to SQL) layer. Data Preprocessing: Data preprocessing was kept the same as in (Min et al., 2019). We also used the solution sets provided by Min et al. (2019)

B.1 SQL Selection Task
We defined a SQL selection task on the development set of WikiSQL. Specifically, for each question, we randomly sampled min(10, |Z|) solution candidates from the solution set Z without replacement while ensuring the ground-truth solution was one of the candidates. A model was required to pick out the ground-truth solution by selecting the candidate with the highest prediction probability. In section 5.3, we only show model accuracy in the first 10 training epochs because for BART base w/ HardEM, SQLova w/ HardEM, and SQLova w/ Ours, model confidence (computed as the average log likelihood of selected SQLs) showed a downward trend after the 2 nd , 4 th , and ≥ 10 th epoch, respectively.

B.2 Choice of Question Reconstructor
We investigated how the choice of the question reconstructor affects results. One alternative choice is a Transformer pre-trained as a denoising autoencoder of questions on the train set. This question reconstructor is the same as BART base except that the number of encoder layers and the number of decoder layers are 3 respectively. We pre-trained the question reconstructor for one epoch to reconstruct original questions from corrupted ones. For 50% of the time, the input question is the original question; otherwise, we followed Lewis et al. (2020) to corrupt the original question by randomly masking a number of text spans with span lengths drawn from a Poisson distribution (λ = 3). Batch size is 4. AdamW optimizer was used with learning rate set to 5e-5.