Open-domain clarification question generation without question examples

An overarching goal of natural language processing is to enable machines to communicate seamlessly with humans. However, natural language can be ambiguous or unclear. In cases of uncertainty, humans engage in an interactive process known as repair: asking questions and seeking clarification until their uncertainty is resolved. We propose a framework for building a visually grounded question-asking model capable of producing polar (yes-no) clarification questions to resolve misunderstandings in dialogue. Our model uses an expected information gain objective to derive informative questions from an off-the-shelf image captioner without requiring any supervised question-answer data. We demonstrate our model’s ability to pose questions that improve communicative success in a goal-oriented 20 questions game with synthetic and human answerers.


Introduction
Human-machine interaction relies on accurate transfer of knowledge from users. However, natural language input can be ambiguous or unclear, giving rise to uncertainty. A fundamental aspect of human communication is collaborative grounding, or seeking and providing incremental evidence of mutual understanding through dialog. Specifically, humans can correct for uncertainty through cooperative repair (Clark, 1996;Purver et al., 2002;Arkel et al., 2020) which involves interactively asking questions and seeking clarification. Making and recovering from mistakes collaboratively through question-asking is a key ingredient in grounding meaning and therefore an important feature in dialog systems (Benotti and Blackburn, 2021). In this work, we focus on the computational challenge of generating clarification questions in visually grounded human-machine interactions.
One popular approach is to train an end-to-end model to map visual and linguistic inputs directly to questions (Yao et al., 2018;Das et al., 2017). This approach is heavily data-driven, requiring large annotated training sets of questions under different goals and contexts. Another approach has drawn from work on active learning and Optimal Experiment Design (OED) in cognitive science to search for questions that are likely to maximize expected information gain from an imagined answerer (Wang and Lake, 2019;Lee et al., 2018;Misra et al., 2018;Rao and Daumé III, 2018;Rothe et al., 2017;Kovashka and Grauman, 2013). Much of this work has relied on large-scale question-answer datasets (Kumar and Black, 2020;de Vries et al., 2017) for training or retrieval to propose candidate questions or evaluate their expected utility. Others, like (Yu et al., 2020), derive questions from attribute annotations for domain-specific systems.
In this paper, we address an open-domain setting where one cannot rely on an immediate grounding of the meaning of questions in the target domain (in contrast to end-to-end approaches, which assume Figure 2: A set of candidate questions are produced by our question generator, and then ranked according to their expected utility in the question selector module. After posing the highest-ranked question and receiving an answer, the belief distribution over images is updated in the answer handler module and these updated beliefs are then either used to guess the target image or are fed back to the question selection module for the process to be repeated. examples of questions to train on, or semantic parsing approaches, which assume a logical form for questions). Our key contribution is a lightweight method to ground question semantics in the open image domain without observing question examples. Instead, our framework builds a visually grounded question-asking model from image captioning data, deriving question selection and belief updating without existing semantics. Our model generates candidate polar questions, arguably the most common form of clarification in dialogue (Stivers, 2010), by applying rule-based linguistic transformations to the outputs of a pretrained image captioner. We then use self-supervision to train a response model that predicts the likelihood of different answers. Given these predictions, we estimate the expected information gain of each question and select the question with the highest utility. We demonstrate our method's ability to pose questions that improve communicative success in a questiondriven communication game with synthetic and human answerers.

20 Questions Task
We study interactions between questioners and answerers in a visually grounded 20-questions paradigm (see Fig. 1). Both agents are shown a set of k images as a context (k = 10 in Fig. 1). One of these images is privately indicated to the answerer as the target (e.g., bottom row, center), but remains unknown to the questioner. The questioner's goal is therefore to select questions that allow them to identify this target based on responses from the answerer. After a maximum of 20 questions, the questioner must make a guess (i.e., a kway classification). This task can be viewed as the most straightforward extension of a signaling game (Lewis, 1969) to allow for interactive clarification and repair. To approximate the setting of natural "clarification questions" we also consider games that begin with a description of the target. Critically, the appropriate question changes depending on the context of objects and previous information provided by the answerer.

Model
Our model (Fig. 2) maintains a belief distribution, p(y|x t ), about which image y in the set of images Y is the target. This distribution is conditioned on the history of the interaction, x t = (a 1 , q 1 , ..., a t , q t ), which includes all questions, q, and answers, a, exchanged up to the current step, t. Our model is defined in terms of three basic components. At each interaction step, it must generate a set of candidate questions, select one of these candidates based on expected information gain and finally update its beliefs based on the answer.
Question Generator. To generate questions without question examples, we must derive suitable candidates using an alternative method. Specifically, we suggest using a pre-trained image captioner to produce a list of candidate captions, which can then be programmatically transformed into question form. We begin by producing a list of captions for each image y ∈ Y and decomposing each of these captions into multiple polar questions according to a constituency parse, obtained using the Berkeley Neural Parser (Kitaev et al., 2019). We then transform each noun phrase (NP) subtree in each caption's constituency tree into a polar question ('Are there <NP>?' with indefinite articles and plurality chosen for agreement). Using this procedure, we generate an average of 10 candidate questions from each caption (see Appendix A for examples).
Question Selector. To determine the most informative question at turn t + 1, we estimate expected information gain, EIG(y, a; q, x t ), for every question in the candidate set Q (after a question is asked it is removed from the set). EIG is defined as the change in entropy of the distribution over images after observing an answer a ∈ A(q) to question q. Because the initial entropy is the same for every question, maximizing the EIG is equivalent to minimizing the expected conditional entropy of the belief distribution under possible answers. Because different answers are expected given different targets y, we marginalize over a inside the entropy: The distribution p(a|q, y) represents predictions about how the answerer will respond to a question when y is the target. We do not have access to a ground-truth answerer model, so we amortize these predictions by training an answer classifier. We introduced a self-supervision objective by either pairing target images with questions derived from them ('yes' answers) or with questions derived from other images ('no' answers). It should be noted that this data-generation method may occasionally yield a false negative when, for a 'no'labelled question-image pair, a question is sampled that does coincidentally apply to the image; however, these samples represent a minority of the training data. We then trained a logistic classifier using cross-entropy loss on concatenated image and caption embeddings obtained from a CNN and RNN encoder, respectively. This classifier yields a prediction of yes vs. no answers for any unseen pair (y, q) with 94% accuracy on held-out, manuallylabelled datapoints.
Answer Handler. Finally, after obtaining an answer a, our model must update its beliefs for the subsequent time step (and anticipate this update for Eq. 1). The belief update is given by Bayes rule: p(y|x t , q, a) ∝ p(a|x t , q, y)p(q|x t , y)p(y|x t ) The first term p(a|x t , q, y) can be simplified to our amortized answer prediction model described above by assuming that the answer is independent of past interactions. The second term is given by the deterministic question selector model described above. The third term is given by the belief distribution on the previous time step. The initial belief distribution is either uniform, p(y|x 0 ) ∝ 1, or, when an initial description u is provided, it is proportional to the utterance likelihood under the captioning model, p(y|x 0 ) ∝ p(u|y).

Experiments
We evaluated our question-asking framework in grounded interactions with both synthetic and human answerers.

Simulations on synthetic datasets
Before deploying our model in interactions with human speakers, we examined its performance on synthetic datasets where we could carefully control the answerer. We examine two domains: Shapeworld (Kuhnle and Copestake, 2017), a simple artificial dataset of images of random colored shapes paired with captions from a vocabulary of 15 words labeling the possible colors and shapes, and MS COCO (Lin et al., 2015), a more naturalistic dataset containing images of everyday scenes paired with captions elicited by human annotators. Because previous approaches have typically relied on closed-domain question-answer datasets or hand-built question semantics, they are incompatible with our 'open domain' setting. Instead, we compare our full model's performance against several model variants and strong, general-purpose search baselines: a full caption model which generates candidate questions from full image captions without decomposition, comparable to a linear search checking one image at a time; a random question model which selects questions randomly instead of using the expected information gain objective; and, a binary search algorithm which serves as an upper-bound "oracle," unfettered by the expressivity of real language, by randomly halving the set of potential target images with each step of the interaction rather than posing a natural language question. We evaluate these models on a total of 1,000 games sampled from each dataset using contexts of size k = 10 images.
For Shapeworld, we paired our questioner with an artificial answerer constructed to provide  ground-truth answers to generated questions (Figure 3, left). Our proposed model outperforms the random baseline as well as the full caption model, which produces questions that are too specific to efficiently narrow the space of potential target images, while only slightly under-performing an upper-bound binary search algorithm. These findings demonstrate the utility of having a question set of varying specificity (via decomposing full captions into NPs) as well as the expected information gain objective which adapts question selection to the model's current knowledge. For MS COCO, we construct an artificial answerer that uses a simple heuristic, as ground-truth answers are not readily available for MS COCO. This answerer responds "yes" if a question is generated from the target image, and "no" otherwise ( Figure 3, middle). We again see that our model greatly outperforms a random questioner, but outperforms the full caption baseline to a lesser extent than we observed on Shapeworld. The larger gap between our model and binary search also indicates significant room for improvement. One possible explanation for this gap is the difficulty of finding attributes which appropriately "split" a random set of natural images. To evaluate performance when a clear division of the image set is expressable in natural language, we created an alternative test set  where we ensured that the 10 images in the context were balanced across two categories in COCO (i.e., five "motorcycles" and five "baseballs"). We found that the model was indeed better able to divide the image set when we guaranteed that some high-level cut between the images existed (Table 1). When models were given an initial description of the target image before asking any questions (Figure 3, right), we see that questions are still useful -improving accuracy by 6% from the caption alone.
Extension to wh-questions. While our main results use polar questions exclusively, our framework has the potential to be extended to more general wh-questions. Using wh-movement rules we can derive questions from image captions that ask about more abstract properties of objects within images (e.g., given the caption "three men holding surfboards on a beach" we can straightforwardly derive questions like: "How many men are there?", "Where are the men?", or "What are the men holding?"). To illustrate this extension we provide preliminary results for simple 'what' questions. We generate these questions by identifying instances of noun phrases followed by verb phrases in captions and transforming these into a set of 'what' questions with single-word answers. We extract the noun (NN) and verb (VBG) from their respective phrases then produce questions of the form 'What   is the <NN> <VBG>?'. To accommodate these questions in our model, we simply modified our answer classifier to produce a probability distribution over the entire vocabulary (rather than a binary yes-no). By incorporating what questions into our framework, we see an improvement of almost 3% after 20 questions are asked (Table 2).

Interactive human experiments
We ran two experiments to evaluate our question generation model in interactions with real human partners. We recruited a total of 40 participants from Amazon Mechanical Turk to play 10 games each in which our model asked questions until the entropy of the belief distribution over images fell below 1.0 or until 20 questions were asked. Participants were prompted to give either a "yes", "no", or "N/A" response to each question. In the first human experiment, games were sampled from the same 1,000 MS COCO games used for synthetic evaluation (Table 3). We found that our question-asking model was able to successfully improve target selection accuracy when paired with a human answerer, suggesting that our model's questions are human interpretable and that human answers are effective for target selection.
Our second human experiment examines the more challenging case of asking "clarification questions" in a referential setting. In this experiment we used larger contexts of k = 25 images sampled from the MS COCO test set, and human participants were prompted to give a description of the target to initiate the interaction. Our model formed (uncertain) beliefs based on this initial utterance and proceeded to ask clarification questions which we found improved by 16.5% from the image description alone (see Table 4).

Conclusions
We introduce a question generation framework capable of producing open-domain clarification questions. Instead of relying on specialized questionanswer training data or pre-specified question meanings, our model uses a pretrained image captioner in conjunction with expected information gain to produce informative questions for unseen images. We demonstrate the effectiveness of this method in a question-driven communication game with synthetic and human answerers. We found it important to generate questions varying in specificity by decomposing captioner utterances into component noun phrases. Having generated this set of potential questions, selecting based on estimated information gain yielded useful questions. Without seeing question examples, our framework demonstrates a capacity for generating effective clarification questions.
Future research should aim to generate more diverse question sets, allow for more expressive answers, and address abstract properties of objects within images. One approach, as demonstrated by our preliminary work with 'what'-questions, would be to extend our framework to incorporate additional types of wh-questions. Integrating this clarification capacity more fully into collaborative, goal-directed dialog agents will allow them to engage in cooperative repair.  with and without the initial target image description "food".