Looking for Confirmations: An Effective and Human-Like Visual Dialogue Strategy

Generating goal-oriented questions in Visual Dialogue tasks is a challenging and longstanding problem. State-Of-The-Art systems are shown to generate questions that, although grammatically correct, often lack an effective strategy and sound unnatural to humans. Inspired by the cognitive literature on information search and cross-situational word learning, we design Confirm-it, a model based on a beam search re-ranking algorithm that guides an effective goal-oriented strategy by asking questions that confirm the model’s conjecture about the referent. We take the GuessWhat?! game as a case-study. We show that dialogues generated by Confirm-it are more natural and effective than beam search decoding without re-ranking.


Introduction
Quite important progress has been made on multimodal conversational systems thanks to the introduction of the Encoder-Decoder framework (Sutskever et al., 2014). The success of these systems can be measured by evaluating them on task-oriented referential games. Despite the high task-success achieved and the apparent linguistic well-formedness of the single questions, the quality of the generated dialogues, according to surfacelevel features, have been shown to be poor; this holds for systems based on both greedy and beam search (e.g. Shekhar et al. (2019); Zarrieß and Schlangen (2018); Murahari et al. (2019)). Testoni and Bernardi (2021a) found that when taking these surface-level features as a proxy of linguistic quality, the latter does not correlate with task success, and the authors point to the importance of studying deeper features of dialogue structures. We aim to develop a multimodal model able to generate dialogues that resemble human dialogue strategies.
Cognitive studies show that humans do not always act as "rational" agents. When referring to ob- jects, they tend to be overspecific and prefer properties irrespectively of their utility for identifying the referent (Gatt et al., 2013); when searching for information or when learning a language, they tend to follow confirmation-driven strategies. Modelling such behaviour in language learning, Medina et al. (2011) and Trueswell et al. (2013) propose a procedure in which a single hypothesized word-referent pair is maintained across learning instances, and it is abandoned only if the subsequent instance fails to confirm the pairing. Inspired by these theories, we propose a model, Confirm-it, which generates questions driven by the agent's confirmation bias.
Take the example of a referential guessing game in which an agent has to ask questions to guess an object in a given image. Confirm-it will ask questions that reinforce its beliefs about which is the target object, till proven otherwise. For instance, in Figure 1, after learning that the target is a living entity (turn 1), the agent conjectures the target is the dog on the right of the picture (though in principle, it could have been any of the candidates). Hence, the decoder generates the question that would let it confirm such belief, "is it a dog?". If its expectations are not met (viz., it receives a negative answer to such question -turn 2b), it moves its attention to another candidate object. We do not claim that our choice represents the optimal strategy to play the game, but we believe that it makes the generated dialogue more human-like.
To evaluate this strategy, we take as a test-bed GuessWhat?! (de Vries et al., 2017), a two-player game between a Questioner that has to guess the target, and an Oracle (called "external Oracle" in the following) who is aware of the target. The widely used architecture of the Questioner, GDSE, jointly trains a Question Generator (QGen) and a Guesser (Shekhar et al., 2019). We augment this architecture with a module that simulates an internal Oracle. Being an "internal" Oracle, at test time this agent does not know what the target object is: while at training time it learns to answer questions by receiving the gold standard datapoint (the question, the actual target, and the human answer), at test time it assumes the target is the candidate object to which the Guesser assigns the highest probability. Hence, the three modules of the Questioner straightforwardly cooperate one another. The internal Oracle guides the QGen to ask questions that reinforce the Guesser's beliefs. Concretely, at training time, through Supervised Learning (SL) the QGen learns to ask human-like questions turn-by-turn, the internal Oracle to answer them, and the Guesser to guess the target object once the dialogue ends. At test time, we implement a beam search re-ranking algorithm that simulates the single-conjecture learning strategy used by humans: among the questions the QGen generates via beam search, the algorithm promotes the questions whose answer (obtained via the internal Oracle that receives the candidate with the highest probability as the target) increases the most the model's confidence in its hypothesis about the target.
We run both quantitative and qualitative analyses, and evaluate the effectiveness of the dialogue strategy by asking human annotators to guess the target object given the dialogues generated by Confirm-it. We compare results giving the dialogue generated by Confirm-it when using the re-ranking algorithm and when generating the question proposed by the plain beam search. We show that the task accuracy of both the conversational agent and human subjects increases when receiving the dialogues generated by the Confirm-it re-ranking algorithm.

Related Work
For open-ended language generation, Holtzman et al. (2020) claim that decoding strategies that op-timize for output with high probability (like beam search) lead to highly deteriorated texts, since the highest scores are often assigned to generic, incoherent, and repetitive sequences. Several works propose reranking strategies on the set of hypotheses produced by the beam search following different criteria (Dušek and Jurčíček, 2016;Blain et al., 2017;Agarwal et al., 2018;Borgeaud and Emerson, 2020;Hargreaves et al., 2021) to improve both the performance on a given task and the quality of the output. In this work, we present a cognitivelyinspired reranking technique for a visual dialogue questioner agent.
In visual dialogue systems, the quality of the output has been improved mainly by aiming at reducing repetitions in the output. This goal has been achieved through Reinforcement Learning by adding auxiliary objective functions (Murahari et al., 2019), intermediate rewards (Zhang et al., 2018), regularized information gain techniques (Shukla et al., 2019), or intermediate probabilities with an attention mechanism (Pang and Wang, 2020). Different from these works, we do not use the Reinforcement Learning paradigm and, instead of focusing on improving surface-level features, we indirectly operate on the dialogue structure.
Ruggeri and Lombrozo (2015) studied the way children and young adults search for information while asking yes-no questions given a set of candidate hypotheses. The authors found that when prior knowledge favours some hypotheses over others, participants asked more hypothesis-scanning questions (i.e., questions that are tentative solutions, with a specific hypothesis that is directly tested). This is in line with the observation in Baron (2000) that humans phrase questions to receive an affirmative answer that supports their theory, and with the broader finding in Wason (1960) that they tend to select the information that is in accord with their prior beliefs. Inspired by these studies, we propose a new dialogue strategy for playing referential guessing games by exploiting the probabilities assigned by the Guesser module to different candidate objects.

Task and Dataset
GuessWhat?! (de Vries et al., 2017) is an asymmetric game involving two human participants who see a real-world image from MSCOCO (Lin et al., 2014). One of the participants (the Oracle) is secretly assigned a target object in the image, while the other participant (the Questioner) has to guess it by asking binary (Yes/No) questions to the Oracle. The GuessWhat?! dataset consists of more than 150k human-human English dialogues containing on average 5.3 questions per dialogue.

Model and Re-ranking Strategy
Our model, Confirm-it, builds on GDSE (Shekhar et al., 2019). In the latter, the hidden state representation produced by a multimodal encoder is used to jointly train the question generator (QGen) and the Guesser module. The image is encoded with a ResNet-152 network (He et al., 2016) and the dialogue history is encoded via an LSTM network. QGen uses greedy search to generate questions. To this multi-tasking setting, Confirm-it adds an internal Oracle trained to answer the question at each turn. Moreover, it relies on beam search and, at inference time, it goes through a re-ranking phase which simulates the single-conjecture learning strategy. The model architecture is provided in the Supplementary Material (SM) and the algorithm is spelled out below.
Oracle provides an answer A to Q 10: Algorithm 1 describes the beam search reranking algorithm used by Confirm-it to promote the generation of an effective dialogue strategy. Given an image, a set of candidate objects, a target object o t , and a beam size of B, at each dialogue turn the model predicts a probability distribution over the set of candidate objects given the current dialogue history. The candidate that receives the highest probability is considered the model's hy- pothesis c h . The QGen outputs B questions, ordered by their probability. Each of these questions is answered by the model's internal Oracle that receives c h as the target object. Among these B questions, Confirm-it selects the question Q that, paired with the answer provided by the internal Oracle, increases the most the model's confidence over c h , measured as the probability assigned by the Guesser. The external Oracle (who is aware of the real target object o t ) answers Q, and this new question-answer pair is appended to the dialogue history. In SM we provide a step-by-step example of how Confirm-it works. None of the features of our case-study are crucial for the method to be applied to other tasks, e.g. it does need the questions to be polar, it does not need the questions to be visually grounded, it does not need the dialogue to be asymmetrical.
Implementation details 1 For the multi-task training, we adopt the modulo-n training proposed in Shekhar et al. (2019), i.e. we train the Oracle and guesser modules every n (=7) epochs of the QGen. At inference time, we use a beam size of 3 and let the model generate dialogues of 5 turns.

Results
We study to what extent the re-ranking phase lets the model generate more effective and more natural dialogues. To this end, we evaluate the Confirm-it task-accuracy with and without the ranking phase 2 and report qualitative analyses of the dialogues.
Task-accuracy Table 1 Table 2: Human Accuracy refers to the task accuracy achieved by human annotators when receiving dialogues generated by the plain Beam Search, Confirm-it re-ranking, or the original dialogues produced by human players from the GuessWhat?! test set. The other columns report relevant statistics of the dialogues: percentage of games with at least one repeated question verbatim, hallucination rate (CHAIR-s), percentage of positive answers in the final turn (% yes Last Turn), and percentage of consecutive questions not seen at training time (lexical overlap, % novel q t−1 , q t per dialogue).  When the model undergoes the re-ranking phase, Confirm-it accuracy has an increase of +4.35% with respect to what it achieves when it outputs the question selected by the plain beam search, and an increase of +4.8% against greedy search. Note that, instead, randomly re-ranking the set of questions lowers the performance. This result shows that confirmation-driven strategies help generate more effective dialogues. Interestingly, our re-ranking method does not require additional training compared to the SL paradigm.
More Effective Dialogues To verify whether the improvement of Confirm-it is really due to the generation of more effective dialogues to solve the guessing task, we asked human subjects to guess the target given a dialogue. We sampled 500 games from the GuessWhat?! test set containing less than 7 candidate objects. Each participant played 150 games equally divided among dialogues generated by the model with the plain beam search, with our re-ranking strategy, and by the GuessWhat?! human players (taken from the original test set). We made sure no participant played the same game more than once. In total, 10 English proficient volunteers within our organization joined the experiment. As we can see from Table 2, human annota-tors reach an accuracy of 70.8% in identifying the target object when receiving dialogues generated by beam search and 77% with Confirm-it, suggesting that the re-ranking phase let the model generate more effective dialogues. The accuracy that the annotators achieve when playing the game with dialogues extracted from the original GuessWhat?! test set (and thus generated by human players) is much higher (96%). Figure 2 reports a sample game that illustrates the difference between a dialogue generated by human players, one generated by the plain beam search, and one by our re-ranking algorithm. The dialogue generated by beam search contains a repetition ("is it a glass?"), it asks about entities not present in the image ("chair" and "glasses") and it ends with a nonconclusive negatively answered question. These features contribute to making the dialogues sound unnatural. We check whether the re-ranking phase helps our model to get closer to human dialogues with respect to these features. To this end, we compute the percentage of games with repeated questions and with the last turn containing a positively answered question. Moreover, we employ CHAIRs to measure the percentage of hallucinated entities in a sequence, originally proposed in Rohrbach et al. (2018) for image captioning and recently applied also to GuessWhat?! (Testoni and Bernardi, 2021b). CHAIR-S is defined as the number of dialogues with at last one hallucinated entity divided by the total number of dialogues. As we can see from Table 2, dialogues generated by Confirm-it contain fewer games with at least one repeated question compared to the beam search setting (-8.14%), fewer games with hallucinated entities (-2.61%) 3 , and more games with the last turn con-taining a positively answered question (71.87% vs. 76.68%). The reduced number of hallucinations is a direct consequence of the Confirm-it strategy: following up on a single object through the dialogue, the model is less likely to engage in spurious exchanges on irrelevant objects. Though this strategy continuously looks for confirmations, it is worth noting that it does not increase the number of repetitions, which instead are significantly reduced. This is an interesting property emerging from the interplay between the internal Oracle and the re-ranking strategy, which suggests that asking the very same question more than once in a dialogue does not increase the model's confidence in its hypothesis.

More Natural Dialogues
Qualitative Analysis of the Strategy We also evaluated the strategy followed by Confirm-it by looking at the model's decisions throughout the dialogue. Interestingly, the model does not select only questions for which it expects a positive answer, though they are the majority (67%). See the SM, for a game in which the re-ranking promoted a question answered negatively by the internal Oracle. Moreover, though the model looks for confirmations, it properly updates its beliefs when disconfirmed: when the model receives from its interlocutor an answer different from the one it expects (based on its internal Oracle), in 70% of the cases the Guesser changes the probabilities over the candidates accordingly, i.e., it assigns the highest probability to a new candidate object. Finally, the use of a human-like strategy does not imply having learned to simply mimic human dialogues from the training set: the re-ranking shows an absolute increase of +12% in the number of pairs of consecutive questions not seen during training (see Table 2).

Discussion and Conclusion
In this paper, we propose Confirm-it, a multimodal conversational model based on a decoding strategy inspired by cognitive studies of human behavior. We show that, through the proposed beam search rereanking algorithm, our model generates dialogues that are more effective (based on task-accuracy) and more natural (based on the dialogues features discussed above). We believe further improvement could be obtained by increasing the performance of every single module. Moreover, the structure of the generated dialogues remains to be analysed, total number of objects mentioned. CHAIR-i results: 18.28 (Beam Search), 15.02 (Confirm-it), 4.11 (Human Dialogues). and we agree with van der Lee et al. (2021) that a proper evaluation should involve humans. In future work, our method can be easily extended to other task-oriented dialogue tasks which involve a conversational agent as far as it has a module that generates questions and a module that performs a classification task. Depending on the task at hand, different ways to take intermediate probabilities into account can be designed, but the core idea of the method would not change.

A Supplementary Material
Section 4 of the paper describes the Confirm-it model and Figure 3 shows its architecture. In section 5 (Qualitative analysis of the strategy), we highlight that Confirm-it does not select only questions for which it expects a positive answer, as shown in Figure 5. In this case, given the dialogue history H, the model's hypothesis c h (the candidate that receives the highest probability according to the Guesser module), and a set of questions q 1 , q 2 , q 3 ordered by their probability according to beam search, the question that helps the most (answered by the internal Oracle taking c h as the target) is q 2 . Figure 6 illustrates a step-by-step example of how Confirm-it works.    The question that helps the most in confirming c h is q 2 . Figure 5: Given an image and the dialogue history H, Confirm-it assigns the highest probability to c h (marked in yellow). Beam search generates three questions for the follow-up turn (ordered by their probability): thanks to its internal Oracle, the model anwers each of these questions by taking c h as the target. Confirm-it selects q 2 (which receives a negative answer according to the internal Oracle) as the question that helps the most in confirming c h . Human Annotation Evaluation. Figure 4 shows the annotation schema used by human participants in our study, as described in Section 5. The participants in this study are English proficient volunteers within our organization. Each participant is instructed on the guessing task by playing some trial games. The participant is admitted to the annotation only if he/she shows a clear understanding of the task. Given an image, a dialogue and a set of candidate objects with colour-matching boxes, participants express their guess by typing the number corresponding to the box of the selected candidate. Dialogues generated by human annotators from the GuessWhat?! test set, by Confirm-it and by beam search without re-ranking were randomly presented.

Figure 6:
Step-by-step illustration of how Confirm-it works.