Contrastive Multi-document Question Generation

Multi-document question generation focuses on generating a question that covers the common aspect of multiple documents. Such a model is useful in generating clarifying options. However, a naive model trained only using the targeted (‘positive’) document set may generate too generic questions that cover a larger scope than delineated by the document set. To address this challenge, we introduce the contrastive learning strategy where given ‘positive’ and ‘negative’ sets of documents, we generate a question that is closely related to the ‘positive’ set but is far away from the ‘negative’ set. This setting allows generated questions to be more specific and related to the target document set. To generate such specific questions, we propose Multi-Source Coordinated Question Generator (MSCQG), a novel framework that includes a supervised learning (SL) stage and a reinforcement learning (RL) stage. In the SL stage, a single-document question generator is trained. In the RL stage, a coordinator model is trained to find optimal attention weights to align multiple single-document generators, by optimizing a reward designed to promote specificity of generated questions. We also develop an effective auxiliary objective, named Set-induced Contrastive Regularization (SCR) that improves the coordinator’s contrastive learning during the RL stage. We show that our model significantly outperforms several strong baselines, as measured by automatic metrics and human evaluation. The source repository is publicly available at ‘www.github.com/woonsangcho/contrast_qgen’.


Introduction
User queries on web search engines can sometimes be vague. Search engines may resolve this ambi- Figure 1: Non-contrastive and contrastive method for multidocument question generation. Left: noncontrastive modeling that takes input as a set of positive documents. However, model-generated questions from this method are rather generic and not specific to the input documents. Right: contrastive modeling, which considers both positive and negative document sets, and learns to generate questions that are more grounded on the positive document set. guity by suggesting clarification options back to the user in the form of questions (Braslavski et al., 2017;Aliannejadi et al., 2019;Zamani et al., 2020). However, asking the right clarification questions is a challenging information-seeking task, given a plethora of possible questions (Rao andDaumé III, 2018, 2019;Qi et al., 2020). One workaround is to take informational cues from the search engine results given the initial query. The clarification options are then generated from non-ranked and non-overlapping thematic partitions of the search engine results. The whole pipeline is akin to the pseudo-relevance feedback (Rocchio, 1971;Cao et al., 2008). This can significantly reduce the search space, and has the potential to generate correct clarification questions within the context (Cho et al., 2019b).
This particular approach may involve three nontrivial phases: i) retrieval: gather the initial return documents by the search engine; ii) partition: partition the documents into semantically similar clusters in an unsupervised manner; iii) multidocument question generation: generate a clarification question by finding an "overlap" among documents in each cluster. In principle, the clarification questions should be specific to each cluster rather than generic and bland, otherwise it is counter to the objective of clarification (Radlinski and Craswell, 2017). In this work, we focus on developing a multi-document question generator to generate cluster-specific questions in the iii) step. Nevertheless, we believe our approach can be readily applied to multi-document text generation such as summarization (Liu and Lapata, 2019) and response generation (Zhang et al., 2020).
We address this challenge by leveraging contrastive learning. Given a set of positive documents D + and a set of negative documents D − (where D − is yet semantically close to D + ), we propose a new strategy to generate a question that is semantically relevant to D + and far away from D − . Ideally, the model would use both D + and D − to identify distinguishing features between the two sets and constrain the generation to be specific to D + . The similarity between the D + and D − makes the generation more challenging and forces the model to be as specific as possible in order to distinguish between the two sets. The comparison between the contrastive and non-contrastive multi-document question generation is illustrated in Figure 1.
This task is particularly challenging because i) there does not exist direct supervised ground-truth multi-document question given positive and negative sets of documents. ii) The whole procedure involves multiple aspects including language understanding, inter-document information aggregation, coordinative planning and language generation. In theory, the generator can be trained to maximize the chance that the generated question specifically retrieves the given document cluster, using RL. However, the space of possible sequence is prohibitively large which results in large variance in RL (Lewis et al., 2017). To effectively reduce the search space of RL, we employ a hybrid supervised learning (SL) and RL strategy. We also propose a novel reward-shaping auxiliary objec-tive, Set-induced Contrastive Regularization (SCR) (Section 2), which heuristically drives the generation closer towards D + , by minimizing/maximizing the KL divergence between the hypothesis distribution and distributions induced by D + / D − .
Our contributions are summarized below: i) We develop a novel Multi-Source Coordinated Question Generator (MSCQG) model that is trained using a hybrid hierarchical generation scheme. The document-specific generator is fine-tuned from GPT-2 and the inter-document coordinator is trained using reinforcement learning. ii) We introduce Set-induced Contrastive Regularization (SCR), an auxiliary regularizer that pushes MSCQG toward D + relative to D − while limiting the effect of D − in a principled manner. iii) Empirical results show that our model is able to generate more grounded and specific questions, significantly outperforming existing baseline models in automatic measures and human evaluation.

Method
Overview: The overview of our model is illustrated in Figure 2. The model consists of two major components: i) The document-specific generator generates a question from a single document, and is fine-tuned from OpenAI  ii) The inter-document coordinator integrates multiple-document information from the documentspecific generator instances. The coordinator is trained using reinforcement learning after fixing the document-specific generator.
During the RL training, at each generation time step, each (positive and negative) document will independently use the same generator trained from i) to predict the next token. The coordinator will learn to aggregate the probabilities to a consensus probability by maximizing a reward function. The reward function is designed to encourage the generated question to tie to the positive set and to be away from the negative set. The word newly generated from the consensus probability are concatenated to all documents as inputs for next time step.

Document-specific Generator:
At the first pre-training stage, we load the publicly available GPT-2 (Radford et al., 2019) model as our underlying document-specific generator. The GPT-2 model leverages massive out-domain data and serves as a good initialization to generate grammatical and Figure 2: System overview. The example is an illustration using fictitious tokens for ease of understanding. Our MSCQG model learns to attend different weights and form a final aggregated distribution at each decoding time. The decision to enforce or penalize the negative set distributions to the aggregated distribution is controlled in a principled manner.
informative question. Then, we further fine-tune the language model on MS- MARCO (Nguyen et al., 2016) selected document as an input and the corresponding question as an output.
Ranking-based Rewards: Before moving to the RL training, we first describe calculating the reward signal based on retrieval statistics from a BERTbased ranker (Nogueira and Cho, 2019) (Ranker), a state-of-the-art model * in the MS-MARCO document retrieval task (Nguyen et al., 2016) , is trained to rank (document, question) pairs. This ranker assigns high scores for true positive document and question pairs. We assume the ranker delivers an accurate reward signal since it achieves good performance on the challenging MARCO retrieval task, which covers a vast range of general topics. Let q be the generated question from the underlying generator block and coordinator with the positive and negative document sets ( D + and D − ) as the input.
We pairq with each of the documents in the positive and negative set, and evaluate the question-document pairs through the ranker for answer-relevancy. Using the scores and their memberships in D + or D − , we compute retrieval statistics, such as Precision@10 and mean-Average-Precision (mAP) (Zhu, 2004) which are candidate non-differentiable rewards R.
* http://www.msmarco.org/leaders.aspx Training an Inter-generator Coordinator via RL: Next, we train a coordinator system using policy gradient to optimize the reward described above. The separation between the generator and the coordinator aims to ease the RL training by significantly reducing the action space.
Note that the generator model is fixed during this stage. We find that using RL to train the entire generating pipeline yields large variance since the action space is large and the auto-regressive nature of the generation process further amplifies such variance. Therefore, we fix the underlying generator component and then on top of multiple instances of the underlying generator, we stack our coordinator model, which is trained using RL in isolation. Instead of training both token-level GPT-2 and document-level coordinator over multiple GPT-2 instances using RL, only the coordinator is trained using RL, which structure dramatically reduces variance.
The coordinator is a transformer-based (Vaswani et al., 2017) model to utilize its superior attention capabilities across input documents.Unlike Transformer Decoder (Liu et al., 2018), there is no causal mask. Instead, the coordinator model uses the hidden states updated every decoding time from the underlying fine-tuned GPT-2 (Radford et al., 2019) language model generators.
We add learned cluster embedding c i to the input document hidden states h i , similar to learned positional embedding (Devlin et al., 2019), to indicate whether the source document i is in D + or D − .
The coordinator model consists of n recurrent transformers blocks (Vaswani et al., 2017), followed by three different feed-forward layers (FF w , FF v , and FF z ) to output w, v, and z.
w and v are the 10-dimensional attention weights that sum to 1.0 among the positive documents D + , and negative documents D − . z parametrizes η in how much the coordinator model penalizes, or sometimes reinforces, weighted average of decoding distributions from the negative set D − . η is a simple heuristic variation of tanh such that the image lies in (−1, 0.5) for all real numbers R. Thus, η is a damped penalization coefficient.
Given w, v, and z, we obtain the final question decoding distribution at test time t.
where θ is the coordinator's parameters, the subscript + is a ReLU (Nair and Hinton, 2010) operator that selects non-negative weighted tokens, and C is the normalizing factor that converts it into a distribution. The concatenations of each input document in D + and D − , EOS token, and partially decoded question word sequence are used to obtain new hidden states and next decoding distributions. The decoding process is repeated until the generation is complete.
Policy Gradient Loss: The policy gradient loss is defined as follows: With a complete generationq, a terminal retrieval statistics reward is computed from the Ranker scores and score memberships, noted as R(q| D + , D − ). This reward weights the sum of log-likelihoods of generating the observed words o t given the generation so farq <t , from the underlying generator G, and the two document sets. We use oracle questions as the policy gradient baseline for variance reduction in R baseline . Results using a different policy gradient baseline are in Appendix.

Set-induced Contrastive Regularization:
We further propose an auxiliary to provide richer signals when optimizing the coordinator model. The intuition is that we would like to encourage the coordinator model to generate questions toward the positive set D + relative to the negative set D − . We name the regularizer as Set-induced Contrastive Regularization (SCR) because the decoding distributions from D + and D − guide the coordinator to learn to make contrasts between the two sets. Although the decoding distributions from D + and D − are not gold supervision signals, modifying distributional distance toward or away from them helps regulate specificity to D + . The former idea can be formulated as minimizing the KL-divergence, evaluated at timestep t: We minimize both the forward and the reverse KL divergence since the forward KL does not penalize high mass of π θ where π i does not. Likewise for the reverse KL. On the other hand, the latter idea can be formulated as maximizing the KL-divergence against the negative set, evaluated at time step t: However, we need to cap the negative set penalty rather than naïvely maximizing it, more restrictively if the positive set and the negative sets are semantically close. Intuition is that if the KL divergence against the negative set is too large, then we do not penalize further. Therefore, we define our contrastive regularization function as follows: where T is the length of the completed generation, and ν t is the similarity measure between positive and negative sets at decoding time t. Specifically, Negative Entropy Loss: We add negative entropy loss L H across the attention weights w and v, averaged over T to encourage the model attend to all the documents rather than attend to a small subset of the documents and risk losing positive and  negative set representational information. (15) We finally optimize for the following loss: where λ 1,2,3 are the scaling hyper-parameters.

Experiments
Dataset: We use the MS-MARCO Q&A dataset (Nguyen et al., 2016) where for the Bing query q, we consider the top-10 retrieved documents as our positive set D + . To get our negative set D − , we use the Conversational Search † dataset, which contains additional annotations for the same MS-MARCO Bing queries, to find a query q that is similar to q yet not a paraphrase and † https://github.com/microsoft/MSMARCO-Conversational-Search consider the top-10 documents retrieved for q as our negative set D − . In total, we gather 100K train/10K dev/10K eval data points. Details of the pre-processing, usage of additional annotations from the secondary dataset, and experimental configuration are in Appendix.
Automatic evaluation: The generated questions are evaluated through standard retrieval-based metrics: MRR and MRR10 (Voorhees, 1999;Radev et al., 2002), nDCG (Järvelin and Kekäläinen, 2002), precision, mAP. These metrics are computed from the 10 positive and 10 negative document sets (=: Out-Sample IR). In addition, as a standardized evaluation routine in the MS-MARCO Retrieval task, for each generated question, we use Lucene ‡ to retrieve the most relevant 100 MARCO documents via BM25 (Robertson and Zaragoza, 2009), and use the retrieved document set and a trained model to rank (document, generated question) pairs, thus compute the retrieval statistics (=: Search-Engine Augmented IR).
The generated questions are also evaluated in ‡ https://lucene.apache.org/ Figure 3: Out-Sample IR: mAP among D + and D − Figure 4: Search-Engine Augmented IR: mAP Figure 3 shows that our model MSCQG P G+SCR+H outperforms the oracle questions by a small margin on the Out-Sample IR. In the larger retrieval evaluation using Lucene, it performs subpar against the oracle questions, but performs significantly better than all the considered baseline models, shown in Figure 4.   Human evaluation: We conduct human evaluation through Amazon Mechanical Turk where we evaluate questions generated by MSCQG, MSQG GP T 2 , and the oracle question in four criteria: fluency, relevancy, answerability, and overall. First, we randomly select 600 (d, q A , q B ) tuples where the d is any from D + and q A , q B from the three questions, and collect responses on which question is preferred over the other. Secondly, we evaluate 600 (d + , d − , q) tuples where given a question, d + , d − are randomly chosen from D + and D − . This can determine questions' specificity to D + relative to D − . Each sample is judged by 3 crowd-sourced workers who passed a rigorous spam-detection screening, totalling 3,600 samples to obtain reliable results. For details, see Appendix. Top-TFIDF@K: Why do we not simply retrieve the top question implied by the 10 positive documents? To this end, we design a retrieval baseline using the learned TF-IDF (Luhn, 1957;Jones, 1972;Salton and McGill, 1983) weights. This baseline re-evaluates the collection of retrieved questions from the corpus, gathered against each document in D + using TF-IDF, and retrieves the most relevant question. For design details, see Appendix.
Top-Frequent@K: Another retrieval model is to find an intersecting subset among all the 10 top-k question sets. For pseudo-code details, see Appendix.

Results and Analysis
Model comparison and ablation study: For simplicity, we abuse the term oracle by calling the ground-truth question that retrieves D + when constructing the dataset as the oracle question. However, these questions are not gold questions as they might not be the most relevant and specific questions to the given positive and negative sets. Table 1 shows that our proposed model is effective at generating questions given multiple documents. In particular, it shows that policy gradient or set-induced contrastive regularization alone is effective in improving performance. The coordinator performs better when optimized for both policy gradient and regularization objectives.
The retrieval results for the questions that initially clustered D + sets are presented. Note that these are not gold questions because in most cases not all the retrieved documents in D + answer the questions. For clarity of our presentation, we abuse the term and name them as oracle questions. Search-engine augmented IR evaluation shows that our methods are upper-bounded by the oracle MARCO questions.
Entropy regularization improves the searchengine augmented IR scores, in particular, MRR. However, it is not crucial as supplemented by Table 2. For additional results, see Table 4 in Appendix.
Model performance v.s. similarities between D + and D − : cos sim( D + , D − ) is approximated using the oracle questions that are available in the dataset. The similarity is computed by the cosine similarity of the two GEN-Encoder (Zhang et al., 2019) representations. Figures 3 and 4 show that our model generated questions are more grounded on D + than the baseline model generations. The more similar the two sets D + and D − , the more difficult for the models, even humans, to distinguish which document is more relevant, if not answerable, given the generated question. The model outperforms the baseline model uniformly across different similarities between D + and D − .
Role of D − by visualizing w, v, and z: Figures 5   and 6 show that our model MSCQG learns to gradually penalize D − as it sequentially generates words that are more grounded on D + . Notice the roughly uniform weights across D + but increasing penalization weights across D − , in decoding time.
η, which is controlled by the z, is learned to encourage, rather than discourage, certain words during decoding. The displayed D − weights v are multiplied by the dampening factor −η(z) for interpretation purposes, thus it does not necessarily sum to 1, see equation 9. We observe that words that are not semantically distinguishing between D + and D − , are encouraged by the coordinator to maintain readability. For example, the weights of the word of is mostly non-negative, whereas weights for other words are mostly negative. This indicates that the coordinator learns to selectively activate/suppress decoding of certain words by coordinating information from D + and D − .
Human judgments: Table 3 shows that our model significantly outperforms the strong baseline in every aspect. Furthermore, we draw a more favorable conclusion toward our model-generated questions when compared against the oracle questions than from the automatic metrics, which are approximate yet reasonable metrics. The pairwise agreement between judges is 54% ± 1%. The Cohen's Kappa score is 0.19 ± 0.01. Note that this is a reasonable number given the "same" or ambiguous option in pairwise comparisons. Ranker achieves a relatively high Pearson correlation of 0.6 with respect to human evaluation. For details, see Appendix.
(2016) introduced a new type of ensemble of NMT systems which take inputs as multiple sentences in different languages and output a translation into a single language. Each NMT system is trained on a mono-lingual source to target language translation dataset. Garmash and Monz (2016) further developed the multi-source encoder-decoder framework for multi-lingual NMT systems, by learning to assign uneven attention weights, called expert combination weights. To handle multi-source input, we take a similar multi-source encoder-decoder approach  Figure 5 shows that the model learns to push the sequential generation semantics more toward D + by gradually penalizing D − . Figure 6 shows that frequent and semantically less distinguishing words such as 'of' are encouraged even by D − , which empirically aligns with our intuition for TF-IDF. for our coordinator model. For such multi-lingual translation tasks, the target translation is available. However, in our task of generating multi-document questions, the target does not exist which makes it more challenging, thus we train via RL, rather than supervised learning.

Question Generation:
Most prior work on question generation has been on single document i.e. given a document and an answer phrase in the document, generate a question that is answered by the answer phrase (Heilman, 2011;Rus et al., 2010). For a survey, see Pan et al. (2019). However, in our work, we aim to generate a multi-document question that is answerable by multiple input documents. Fan et al. (2018) propose a visual question generation model to generate natural questions about images using reinforcement learning where they use naturalness and human-like as reward signals. In our work, we use retrieval statistics, similar to Nogueira and Cho (2017), derived from a document-question ranker as the reward for training our coordinator model in isolation, rather than the entire generating pipeline.  2013) is a simplified variation of NCE loss. Recently, contrastive learning has also been employed in learning sentence representations (Clark et al., 2020). To our best knowledge, we are the first to leverage contrastive learning and establish set-induced penalization in the context of question generation.

Conclusion
We proposed a novel coordinator model that can generate questions that are more grounded on documents of interest. This coordinator model consists of transformer blocks, and is trained through reinforcement learning and an effective auxiliary: Set-induced Contrastive Regularization.
The rewards are derived from a publicly available state-of-the-art pre-trained ranker (Section 2) to compute retrieval statistics among D + and D − . Our novel contrastive regularization induces generations to be more specific to D + than to D − while limiting the effect of D − in a principled manner by accounting for their semantic similarity.
We evaluate a generated question from each model by assessing how many of the input D + documents among a pool of relevant documents it can retrieve, based on the (document, question) ranker that is trained on the same wide-ranging do-main. For a comprehensive automatic evaluation of the models, retrieval statistics are computed from a larger pool of relevant documents gathered via BM25. Experiment results show that our model significantly outperforms previous neural generation as well as strong retrieval baselines in both automatic and human metrics.
Given the promising comprehensive results of the proposed models and training approach, we can extend the framework with appropriate modifications and train via imitation learning algorithms, and this is left for future work. negative or false negative). This label information is used to train the underlying generator block of their MSQG model (Cho et al., 2019b). A single selected MS- MARCO (Nguyen et al., 2016) document is fed into a long short-term memorybased sequence-to-sequence model to output the corresponding question. An example of the input selected document is: The House of Representatives shall be composed of Members chosen every second Year by the People of the several States.... Article I, Section 2, Clause 1, and the corresponding question is: how long is a term for a member of the house of representatives. We chose this dataset since the question that retrieve the top-10 documents can shed light to relative performance of our model. To find two 10-document sets D + and D − that are similar, we find a pair of questions that are semantically similar. However, computing pair-wise similarities among roughly 1 million questions is computationally intractable. Therefore, we leverage another dataset: MS-MARCO-Conversational Search ¶ : an artificially constructed public dataset that simulate user search sequences.

References
Each data point or session is an artificial sequence of similar questions grounded on true user behavior. Since many similar questions are grouped together, we can reduce the search space for finding pairs of similar questions. Then we take pairs of high semantic similarity (≥ 0.7) yet not a paraphrase (≤ 0.85 following their classification criteria) using GEN-Encoder (Zhang et al., 2019) which two associated 10-document sets do not have overlaps, primarily for prototype evaluation convenience. For deployment models, one may choose to allow overlaps between two sets for more challenging learning. From the two similar 10-document sets, either one is set to positive D + or negative D − , yielding two data points for the our derived dataset.
These pre-processing steps yield 346,215 data points, each of which contains a pair of positive and negative questions, and positive and negative 10-document sets. Training MSCQG on the entire dataset requires processing about 7 million MARCO documents. This is computationally intensive and takes about two days on 8 Nvidia Tesla V100 GPU cards for a single epoch. Therefore, for building small research prototypes and ¶ https://github.com/microsoft/MSMARCO-Conversational-Search benchmarks, we will also release a subset of the data, that consists of 100K/10K/10K training, development, and evaluation data points.

Appendix B. Data Example
Oracle question for D + : number of saturn's moons Oracle question for D − : uranus how many moons Positive Set D + : 1. moons of saturn. there are 62 moons orbiting saturn. the moons of saturn vary not only in size but also in composition and shape. the largest of the moons of saturn is the aptly named titan, more than 5,000 km across and is bigger than mercury. there are 7 major moons of saturn and the rest are grouped based on the mythology from which it is taken.
2. iapetus with a diameter of 1,470 km, it is the 3rd largest moon of saturn. it was discovered by giovanni cassini in 1671. it has a distinct feature of having a bright and dark hemisphere. dione the 4th largest moon of saturn named after a vague character in greek mythology.
3. titan is the largest of saturn's moons and the first to be discovered. titan is the only moon in the solar system known to have a significant atmosphere. nitrogen and methane extend around the moon 10 times as far into space as earth's atmosphere, sometimes falling to the surface in the form of methane rain.
4. saturn has at least 150 moons and moonlets, 53 of which have formal names. titan, the largest, comprises more than 90% of the mass in orbit around saturn, including the rings. saturn's second-largest moon, rhea, may have a tenuous ring system of its own, along with a tenuous atmosphere.
5. their journeys around the ringed planet average from half an earth day to just over four earth years. saturn's moons formed early in the history of the solar system. one of the moons, titan, makes up 96 percent of the mass orbiting the planet. scientists think that the system may have originally housed two such moons, but the second broke up, creating the debris that formed the rings and smaller, inner moons.
6. saturn has a prominent ring system that consists of nine continuous main rings and three discontinuous arcs and that is composed mostly of ice particles with a smaller amount of rocky debris and dust. sixty-two moons are known to orbit saturn, of which fifty-three are officially named. 7. sixteen of the moons are tidally locked, with one face permanently turned toward saturn. the first moon was discovered in 1655. over the next 200 years, the other seven major satellites were spotted. by 1997, astronomers on earth had found 18 moons in orbit around the planet.
8. saturn is the sixth planet from the sun and the second-largest in the solar system, after jupiter. it is a gas giant with an average radius about nine times that of earth. although only one-eighth the average density of earth, with its larger volume saturn is just over 95 times more massive.
9. this temporary name usually consists of the year of discovery and a number indicating the order of discovery in that year. in the case of saturn's moons, these provisory names follow the format s/2005-s1, s/2005-s2 etc. the first s (before the slash) is for saturn. the second s (after the dash) is for satellite.
10. this does not include the hundreds of moonlets comprising the rings. titan, saturn's largest moon, and the second-largest in the solar system, is larger than the planet mercury, although less massive, and is the only moon in the solar system to have a substantial atmosphere.
Negative Set D − : 11. uranus has 27 moons that we know of. five of the moons are large and the rest are much smaller. the five large moons are called miranda, ariel, umbriel, titania, and oberon. titania is the largest moon of uranus and it is covered with small craters, a few large craters, and very rough rocks. ariel is the brightest moon of uranus and has canyons and valleys as well as a lot of craters. umbriel is very dark.
12. uranus can't seem to catch a break these days. besides spinning on its side like the drunkard of the solar system and being the butt of everyone's jokes, new research suggests several of its tiny moons will collide in a million years. uranus can't seem to catch a break these days.
13. the gas giant uranus is the third largest planet in our solar system, has many moons, a ring system, and composed of gases and ices. universe today space and astronomy news login 14. the researchers used cressida's mass and orbit to determine its possible doom. since uranus' 27 moons are tightly packed together, the team posits that in a million years, cressida will likely have a deadly encounter with one of its neighboring moons, called desdemona. previous research and simulations suggest cupid and belinda will also probably smack into each other some time between 1,000 and 10 million years from now.
15. puck, at 162 km, is the largest of the inner moons of uranus and the only one imaged by voyager 2 in any detail while puck and mab are the two outermost inner satellites of uranus. all inner moons are dark objects.
16. uranus, which takes its name from the greek god of the sky, is a gas giant and the seventh planet from our sun. it is also the third largest planet in our solar system, ranking behind jupiter and saturn. like its fellow gas giants, it has many moons, a ring system, and is primarily composed of gases that are believed to surround a solid core.
17. in 1986, the voyager 2 spacecraft hit the jackpot while studying uranus and discovered 10 other moons, including desdemona and cressida. since then, hubble observations have helped bring that number up to 27 for now.
18. at an average distance of 3 billion km from the sun, it takes uranus roughly 84 years (or 30,687 days) to complete a single orbit of the sun. 1 the rotational period of the interior of uranus is 17 hours, 14 minutes. as with all giant planets, its upper atmosphere experiences strong winds in the direction of rotation.
19. uranus' size, mass and orbit: with a mean radius of approximately 25,360 km, a volume of 6.833-10^13 km3, and a mass of 8.68 -10^25 kg, uranus is approximately 4 times the sizes of earth and 63 times its volume. 20. uranus has 27 known satellites, which are divided into the categories of larger moons, inner moons, and irregular moons (similar to other gas giants). the largest moons of uranus are, in order of size, miranda, ariel, umbriel, oberon and titania.

Appendix C. Retrieval Baselines
Top-TFIDF@K and Top-Frequent@K The retrieval baselines are designed to give a relative insight into the performance between MSQG in Cho et al. (2019b) and our novel coordinator model. We use Lucene to retrieve questions instead of documents from a corpus composed of the 1,010,916 MS-MARCO questions. The retrieved questions from Top-TFIDF@K and Top-Frequent@K baselines are evaluated in the same manner as the generated ones.
For the intersection to be non-empty, k should be sufficiently large. However, even for k = 1000, there were no intersecting subset questions for almost all cases. Therefore, we relax the intersection among all 10 retrieved sets, into finding the most frequently occurring question among the 10 top-k retrieved sets. k = 100 was an appropriate value that is not too large to retrieve remotely relevant questions, and not too small to yield vastly different retrieval sets. If there are multiple questions with the same count, we randomly choose one.

Algorithm 1 Top-TFIDF@K
Input: D + , Corpus C For each d ∈ D + , retrieve top-K questions in C; Using all unique questions Q, compute TF-IDF; Let Ψ be the TF-IDF transform operator; q * = arg max q∈Q d∈D + cos sim (Ψ q , Ψ d ); Output: q *

Algorithm 2 Top-Frequent@K
Input: D + , Corpus C For each d ∈ D + , retrieve top-K questions in C; Let S d be the retrieved set for each d; q * = arg max q∈Q d∈D + 1 q∈S d ; Output: q * Appendix D. Experiment Configurations Document-specific GPT-2 Generator: From each document i, the generator yields its final layer hidden state h i ∈ R H (H = 768 is the hidden dimension) and a document-specific discrete output distribution π i ∈ R V (V = 50257 is the vocabulary dimension) from the learned language model head.
Coordinator: The input size is 20 with the dimensionality of the embeddings and hidden states as 768. The number of recurrent layers is 2, with 4 attention heads in each layer. The epsilon value used in the layer normalization is set to 1e−5. The number of cluster embeddings is 2 (positive or negative). The standard deviation of the truncated normal initializer for weight matrices is 0.02. λ 1 , λ 2 , λ 3 = 1.0, 100.0, 0.1. We performed coarse hyper-parameter search for equidistant values in log-scale for all λ 1 , λ 2 , λ 3 , and use the best configuration. Maximum generation length is 20 tokens. We use the BERT (Devlin et al., 2019) version of Adam optimizer (Kingma and Ba, 2014) with weight decay of 0.01 and learning rate of 1e−5.
We trained the coordinator model by maximizing Precision@10 with oracle questions as the policy gradient baseline. It is reasonable to weigh the documents unevenly because often times not all the top-10 retrieved documents from the Bing search engine share the same content. Thus, we leave to the model to learn the optimal attention weights among positive and negative sets that produce a more grounded question. Additional experiment results using a different baseline -self-critic (Rennie et al., 2017) -is shown in Table 4. This shows that our proposed model framework is effective even with any of the two policy gradient baselines. Conceptually, the coordinator model would generate a question that can better retrieve the documents from the positive set, aided by the negative set.

Appendix E. Human Evaluation Details
We performed two human evaluations: In the first experiment, we showed judges one randomly selected positive document, which is about 300 words long, followed by a pair of questions from the three sources. Judges were asked to evaluate which one of the two questions is preferred based on four criteria. For each pair of three sources, we evaluated 200 same random samples for each judge (or 600 samples for the 3 judges), totalling 1,800 samples.
In the second experiment, human annotators evaluated contrastive ability from 1,800 samples of one question, followed by two documents each from the positive and negative sets. Note that our model is trained to generate questions, accounting for the negative set.
The results were averaged across all samples and judges.
For computing the Pearson correlation between the ranker and human evaluation results, we map Option A preferred → 0, Same → 0.5, Option B preferred → 1, accounting for the random assignments between A and B. This projection ensures that image of two metrics are the same (between 0 and 1). Then we compute the correlation value between two results.   , 2017) baseline in the policy gradient, applicable to datasets with no oracle questions. It shows that our framework is also effective using a different baseline. The superscript null-neg denotes models that do not use negative attentions when generating questions. This shows the importance of the negative set in promoting specificity in the generated question. It further corroborates that the non-uniform weighted-sum scheme among D + improves performance because not all documents in D + revolve around the same topic, and the model learns to address this nature of the dataset through unequal weights and generate a more representative question.