Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue?

Decoding strategies play a crucial role in natural language generation systems. They are usually designed and evaluated in open-ended text-only tasks, and it is not clear how different strategies handle the numerous challenges that goal-oriented multimodal systems face (such as grounding and informativeness). To answer this question, we compare a wide variety of different decoding strategies and hyper-parameter configurations in a Visual Dialogue referential game. Although none of them successfully balance lexical richness, accuracy in the task, and visual grounding, our in-depth analysis allows us to highlight the strengths and weaknesses of each decoding strategy. We believe our findings and suggestions may serve as a starting point for designing more effective decoding algorithms that handle the challenges of Visual Dialogue tasks.


Introduction
The last few years have witnessed remarkable progress in developing efficient generative language models.The choice of the decoding strategy plays a crucial role in the quality of the output (see Zarrieß et al. (2021) for an exhaustive overview).It should be noted that decoding strategies are usually designed for and evaluated in text-only settings.The most-used decoding strategies can be grouped into two main classes.On the one hand, decoding strategies that aim to generate text that maximizes likelihood (like greedy and beam search) are shown to generate generic, repetitive, and degenerate output.Zhang et al. (2021) refer to this phenomenon as the likelihood trap, and provide evidence that these strategies lead to sub-optimal sequences.On the other hand, stochastic strategies like pure sampling, top-k sampling, and nucleus sampling (Holtzman et al., 2020) increase the variability of generated texts by taking random samples from the model.However, this comes at the cost of generating words that are not semantically appropriate for the context in which they appear.Recently, Meister et al. (2022) used an information-theoretic framework to propose a new decoding algorithm (typical decoding), which samples tokens with an information content close to their conditional entropy.Typical decoding shows promising results in human evaluation experiments but, given its recent release, it is not clear yet how general this approach is.
Multimodal vision & language systems have recently received a lot of attention from the research community, but a thorough analysis of different decoding strategies in these systems has not been carried out.Thus, the question arises of whether the above-mentioned decoding strategies can handle the challenges of multimodal systems.i.e., generate text that not only takes into account lexical variability, but also grounding in the visual modality.Moreover, in goal-oriented tasks, the informativeness of the generated text plays a crucial role as well.To address these research questions, in this paper we take a referential visual dialogue task, GuessWhat?! (De Vries et al., 2017), where two players (a Questioner and an Oracle) interact so that the Questioner identifies the secret object assigned to the Oracle among the ones appearing in an image (see Figure 1 for an example).Apart from well-known issues, such as repetitions in the output, this task poses specific challenges for evaluating decoding techniques compared to previous work.On the one hand, the generated output has to be coherent with the visual input upon which the conversation takes place.As highlighted by Rohrbach et al. (2018); Testoni and Bernardi (2021b), multimodal generative models often generate hallucinated entities, i.e., tokens that refer to entities that do not appear in the image upon which the conversation takes place.On the other hand, the questions must be informative, i.e., they must help the Questioner to incrementally identify the target object.
We show that the choice of the decoding strat- egy and its hyper-parameter configuration heavily affects the quality of the generated output.Our results highlight the specific strengths and weaknesses of decoding strategies that aim at generating sequences with the highest probability vs. strategies that randomly sample words.We find that none of the decoding strategies currently available is able to balance task accuracy and linguistic quality of the output.However, we also show which strategies perform better at important challenges, such as incremental dialogue history, human evaluation, hallucination rate, and lexical diversity.We believe our work may serve as a starting point for designing decoding strategies that take into account all the challenges involved in Visual Dialogue tasks.

Task & Dataset
GuessWhat?! (De Vries et al., 2017) is a simple object identification game in English where two participants see a real-world image from MSCOCO (Lin et al., 2014) containing multiple objects.One player (the Oracle) is secretly assigned one object in the image (the target) and the other player (the Questioner) has to guess it by asking a series of binary yes-no questions to the Oracle.The task is considered to be successful if the Questioner identifies the target.The dataset for this task was collected from human players via Amazon Mechanical Turk.The authors collected 150K dialogues with an average of 5.3 binary questions per dialogue.Figure 1 shows an example of a GuessWhat game from the dataset.

Model and Decoding Strategies
We use the model and pre-trained checkpoints of the Questioner agent made available by Testoni and Bernardi (2021c) for the GuessWhat?! task.This model is based on the GDSE architecture (Shekhar et al., 2019).It uses a ResNet-152 network (He et al., 2016) to encode the images and an LSTM network to encode the dialogue history.A multimodal shared representation is generated and then used to train both the question generator (which generates a follow-up question given the dialogue history) and the Guesser module (which selects the target object among a list of candidates at the end of the dialogue) in a joint multi-task learning fashion.Testoni and Bernardi (2021c) (2017).
We analyse the effect of a large number of decoding strategies as well as hyper-parameter configuration for each strategy: as highlighted by Zhang et al. (2021), it is crucial to evaluate different hyperparameter configurations when comparing multiple decoding strategies.Among the ones that maximize the likelihood of the sequence, we consider plain beam search (with a beam size of 3) and greedy search.We also consider Confirm-it, the cognitively-inspired beam search re-ranking strategy proposed in Testoni and Bernardi (2021c) for promoting the generation of questions that aim at confirming the model's intermediate conjectures about the target.This strategy re-ranks the set of candidate questions from beam search and selects the one that helps the most in confirming the model's hypothesis about the target.As for stochastic strategies, we analyse pure sampling, top-k sampling (with different k values), and nucleus sampling (with different p values), a strategy proposed in Holtzman et al. (2020) which selects the highest probability tokens whose cumulative probability mass exceeds a given threshold p.We also consider typical decoding (with different τ values), a recently proposed strategy (Meister et al., 2022) based on an information-theoretic framework.We refer to the respective papers for additional details on decoding strategies.We let the model generate 5 questions1 at test time and average the results over five random seeds.

Metrics
We are interested in evaluating different decoding strategies against a set of metrics that reflect the complexity of the different skills required to successfully solve multimodal referential games.
Linguistic Quality: We compute the percentage of games with at least one repeated question, the overall number of unique words used by the model and, in line with the observations in Testoni and Bernardi (2021a), the number of rare words generated by the model, defined as those words that appear fewer than 20 times in the training set.
Visual Grounding: To quantify the rate of object hallucination in the generated dialogues, we compute the CHAIR metric (Rohrbach et al., 2018;Testoni and Bernardi, 2021b).This metric, originally proposed for image captioning, detects hallucination by checking each object mentioned in a generated image caption against the ground-truth MSCOCO objects for that image.The metric consists of two distinct variants: CHAIR-i, or perinstance variant (number of hallucinated objects divided by the total number of objects mentioned in each dialogue), and CHAIR-s, or per-sentence variant (number of dialogues with at least one hallucination divided by the total number of dialogues). 2nformativeness: To study the informativeness of the generated questions, we report the raw accuracy of the model in guessing the target object after each dialogue turn and at the end of the dialogue.A game is considered successful if the model identifies the target object assigned to the Oracle.Similarly, we also report the accuracy of human annotators when guessing the target by reading machine-generated dialogues.

Quantitative Results
Table 1 shows the performance of different decoding strategies against accuracy and dialogue quality, as described by the metrics in Section 4. 3 Confirm-it is by far the best decoding strategy in terms of accuracy and hallucination rate.However, it uses a restricted vocabulary compared to other strategies.A similar issue is observed for greedy and beam search.We find nucleus sampling (with a pvalue of 0.3, much lower than the one used by the authors in Holtzman et al. (2020)) to effectively increase the lexical variety compared to beam search, without damaging accuracy and hallucination rate.Typical decoding, top-k and pure sampling, instead, clearly decrease repetitions and increase the vocabulary richness by generating tokens that are not related to the source input, as indicated by the high hallucination rate.It thus looks like there exists a trade-off between informativeness / visual grounding and linguistic quality.We study the effect of hyper-parameter configurations in stochastic strategies.Specifically, we try various p-values for nucleus sampling and τ -values for typical decoding. 4As shown in Figure 2, both typical and nucleus sampling peak in accuracy with the parameter configurations that also lead to the most repetitions and fewest hallucinations.Conversely, both strategies show the lowest accuracy with the highest hallucination rate.These results confirm the detrimental effect of hallucinations on the performance of the model.It is interesting to note the robustness of typical decoding in generating few repetitions regardless of the τ value.In line with the findings in Zhang et al. (2021) configurations and the peculiar trade-off between informativeness, repetitions, and visual grounding: so far it has not been possible to find a single configuration that optimizes all three at the same time.One crucial ability in GuessWhat?! is asking informative questions that incrementally help in identifying the target: for this reason, we check the accuracy of the model after each new question is asked.Figure 3 shows accuracy per dialogue turn for a set of representative strategies: Nucleus sampling (p=0.3),Typical Decoding (τ =0.7), Confirmit, and pure sampling.To get a broader picture, we let the model generate 10 questions in this setting.Confirm-it stands out by showing the largest incremental increase of accuracy throughout the dialogue, indicating that it generates more effective follow-up questions.Pure sampling, on the other hand, seems to suffer from the very beginning of the dialogue and its accuracy stabilizes soon.It is worth noting that the accuracy of typical decoding gets closer to that of nucleus sampling towards  the end of the dialogue, with the latter leveling off sooner.We conjecture that Confirm-it outperforms other techniques because it takes into account the probability of the Guesser at inference time, so it is guided to generate questions that change these probabilities and thus avoid generic questions.

Human Evaluation
We asked 8 human annotators to guess the target object in a sample of GuessWhat?! games when reading dialogues generated by our model with different decoding strategies.Each participant annotated 100 games (25 per strategy) and the decoding strategy was not revealed during the annotation.As shown in Table 2, humans reach the highest accuracy when reading dialogues generated by Confirmit, followed by typical decoding and nucleus sampling, while pure sampling falls behind.These results, which do not mirror the accuracy result in Table 1, allow us to disentangle the weaknesses of the Guesser (i.e., the classification module that predicts the target) from the actual informativeness of the dialogues.Compared to the model, human annotators seem to better exploit the lexical richness of typical decoding and nucleus sampling.We refer to the SM for additional information about the annotation procedure, in line with the best-practice guidelines in van der Lee et al. ( 2021).
In the field of multimodal NLG, Zarrieß and Schlangen (2018) propose trainable decoding for referring expression generation.The authors propose a two-stage optimization set-up where a small network processes the RNN's hidden state before passing it to the decoder, using BLEU score as a reward for the decoder.We did not analyse this approach in our paper because we focus only on decoding strategies that do not require any change in the architecture or training of the model.We leave for future work an analysis of trainable decoding approaches.Inspired by the findings in Holtzman et al. (2020), Massarelli et al. (2020) propose a hybrid decoding strategy for open-ended text generation which combines the non-repetitive nature of sampling strategies with the consistency of likelihood-based approaches.The authors show that their approach generated less repetitive and more verifiable text.The design of hybrid decoding strategies for multimodal tasks is out of the scope of this paper, but is an interesting subject to pursue in future work.

Discussion and Conclusion
Decoding algorithms are a key component of natural language generation systems.They are usually designed for and evaluated in text-only tasks.We believe multimodal (vision & language) and goal-oriented tasks pose unique and under-studied challenges to current decoding strategies.In this paper, we ran an in-depth analysis of several decoding strategies (and their hyper-parameter configurations) for a model playing a referential visual dialogue game.We found that decoding algorithms that lead to the highest accuracy in the task and the lowest hallucination rate, at the same time generate highly repetitive text and use a restricted vocabulary.Our analyses reveal the crucial role of hyper-parameter configuration in stochastic strategies, an issue that poses several questions about the trade-off between lexical variety, hallucination rate, and task accuracy.While nucleus sampling partially balances the above-mentioned issues, human annotators seem to better exploit the richness of the dialogues generated by typical decoding.Finally, our results demonstrate that a beam search re-ranking algorithm (Confirm-it) generates more effective follow-up questions throughout the dialogue turns.We believe that taking into account the model's intermediate predictions about the ref-erent, like Confirm-it does, represents a promising direction that should be applied also to stochastic strategies in future work, aiming at preserving their lexical richness while reducing hallucinations.
Our results demonstrate that none of the decoding strategies currently at disposal effectively take into account both task accuracy and dialogue quality at the same time.We also highlight peculiar features of each strategy that may guide future research with the goal of designing decoding strategies that properly confront the crucial challenges of multimodal goal-oriented dialogues.
Figures 4, 5, and 6 illustrate how hyper-parameter choice affects the accuracy, the hallucinations, and the repetitions.Top-k sampling (Figure 4) shows decreased accuracy and repetitions, and increased hallucinations, as the k-value gets higher.The same general pattern can be observed with the gradual increase of the p-value in nucleus sampling (Figure 6).On the other hand, typical decoding accuracy peaks at τ = 0.7 (Figure 5).This is also the point at which the repetitions are at their highest and the hallucinations are at their lowest.Both very high and very low τ -values cause lower accuracy, fewer repetitions, and an increase of hallucinations.

A.2 Experiments
Table 3 presents our results in detail for all the parameter configurations we considered.We have computed accuracy percentage, CHAIR-i, CHAIRs, percentage of games with repeated questions, vocabulary size and number of rare words for each decoding method and its respective hyper-parameter configurations.These results are sorted by decreasing accuracy.The 3 best results of each metric are in bold.The annotation was done by 8 human annotators on a sample of GuessWhat?! games.They were recruited within our organization on a voluntary

Figure 1 :
Figure 1: Example of a GuessWhat game from De Vries et al. (2017)

Figure 2 :
Figure 2: Different hyper-parameter values and their effect on the accuracy, hallucinations, and repetitions in typical decoding and nucleus sampling.

Figure 3 :
Figure 3: The accuracy per dialogue turn for four different decoding strategies for dialogues of length 10.

Figure 7 :
Figure 7: Example of the games displayed to the participants for the annotation task.Participants had to select one target object among the list of candidate objects on the right.The machine-generated dialogue is in the red box.

Table 1 :
, this analysis confirms the importance of hyper-parameter Comparison between decoding strategies and their best-performing (in terms of accuracy) hyperparameters.The decoding strategies are sorted by accuracy.

Table 2 :
Human Guess Accuracy based on dialogue generated from different decoding strategies.