An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games

Guessing games are a prototypical instance of the “learning by interacting” paradigm. This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Question Answering (VQA). We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic successful guessing games and 2) a novel way for an agent to play by itself, called Self-play via Iterated Experience Learning (SPIEL). We evaluate the ability of both procedures to generalise: an in-domain evaluation shows an increased accuracy (+7.79) compared with competitors on the evaluation suite CompGuessWhat?!; a transfer evaluation shows improved performance for VQA on the TDIUC dataset in terms of harmonic average accuracy (+5.31) thanks to more fine-grained object representations learned via SPIEL.


Background & Related Work
Learning a language requires interacting with both the environment and other agents (Bisk et al., 2020). Language games represent one common example of this (Wittgenstein et al., 1953), as seen by the important role of play in L1 child language acquisition (Hainey et al., 2016) as well as L2 learners (Godwin-Jones, 2014).
Among the language games defined in the literature (Steels, 2015), guessing games represent the first step in a curriculum for language learning. For example, in GuessWhat?! (de Vries et al., 2017), two agents interact with each other: a Questioner generates questions aimed at finding a hidden object in the scene and an Oracle, aware of the target object, answers the questions supporting the Questioner in playing the game. Different from other language games (Das et al., 2017), guessing games have a specific goal which represents a clear incentive for learning. In addition, they require that the Questioner masters both natural language generation and understanding with a focus on object categories and attributes. For humans, concepts learned in this way are generic and generalisable to new tasks and domains where grounded reasoning is important (Hampton, 1979). However, how well can AI agents generalise with concepts acquired from visual guessing games?
The literature has not explored if representations built from self-play are transferable, focusing instead on large scale self-supervised learning. For instance, large scale image captioning datasets have been used to train multi-modal Transformers (Lu et al., 2019;Tan and Bansal, 2019;Chen et al., 2019). Multi-task learning (Lu et al., 2020) has been used to leverage the diversity of training signals provided combining datasets, but only for discriminative tasks. While some dialogue work (Cogswell et al., 2020) aims to bootstrap a conversing agent from VQA datasets, most work on GuessWhat?! (de Vries et al., 2017;Shekhar et al., 2019;Strub et al., 2017) has designed bespoke models for the task, ignoring the utility of this dataset for other Vision+Language tasks.
We propose self-play as a mechanism for learning general grounded representations. We seed our approach with the GuessWhat?! corpus of questions and objects, and demonstrate how to generalise to other downstream tasks. We propose two different strategies to exploit these data. First, a supervised learning phase is undertaken to learn a Questioner and Oracle model able to play guessing games. Second, the trained agents can be used to play guessing games on images requiring only object annotations as supervision. We show that an agent trained on GuessWhat?! dialogues can use self-play to adapt to new and harder tasks. Specifically, we investigate models' gener-alisation performance and quality of the learned representations on the CompGuessWhat?! benchmark (Suglia et al., 2020), a more extensive evaluation suite for GuessWhat?!. Furthermore, we study how the learned representation help solve VQA on the dataset TDIUC (Kafle and Kanan, 2017). We show overall comparable performance with stateof-the-art models and improvements for specific question types that require object attribute information to be answered correctly.

Methodology
Our proposed transfer/fine-tuning procedure requires a training set of guessing games D g from which we learn a Questioner Q and an Oracle O via supervised learning. Given a set of images I, it is possible to use the trained models Q and O to run the self-play procedure for n epochs obtaining the model Q n . Finally, given a downstream task t and an associated dataset D t based on images from I, we use Q n 's parameters as initialisation for the training procedure on D t .
To apply this procedure, both the Questioner and the Oracle require a multi-modal encoder Γ able to generate d-dimensional representations for the textual tokens h t , for the objects h o , as well as fusing the visual and textual modalities in a representation of the current context h c . After the self-play procedure, only the encoder Γ of the model Q n is used in the fine-tuning process on the downstream task t using the dataset D t . It is important to underline that the presented self-play procedure does not depend on a specific implementation of the multi-modal encoder Γ. A possible implementation is presented in Section 2.4 and it is used in the experimental evaluation of this paper.

Oracle design
The Oracle task is cast as a Visual Question Answering (VQA) task conditioned on the image I, the current question q and on the target objectô. We follow common practice in vocabulary-based VQA (Antol et al., 2015) and we treat the problem as a multi-class classification task over the classes {Y es, N o, N/A}. We use h c as input to a multi-layer feedforward neural network to obtain a probability distribution over the label set.

Questioner design
The Questioner must play two roles: question generation and target object prediction (de Vries et al., 2017). It is beneficial to jointly learn the two tasks because the representations learned by each task are complementary. In addition, they better encode attributes, which favours better generalisation to unseen object categories (Suglia et al., 2020).
To solve the two specific tasks in a multi-task fashion, we design two different heads on top of the shared encoder Γ: 1) the guesser head, produces a probability distribution over every object o i using the encoded representations h o i passed through an MLP; 2) the generator head, a multi-modal decoder, also implemented as an MLP, which predicts a probability distribution over the vocabulary V given the context representation generated by Γ.
We include two losses in our model: 1) the negative log-likelihood of the probability associated by the guesser head with the target objectô (Shekhar et al., 2019); 2) a sequence-to-sequence crossentropy loss (Sutskever et al., 2014) for the generated question tokens. Unlike previous work that trains a separate module to learn to stop (Shekhar et al., 2018), we add a special token [STOP] to the input data so that it learns when to stop more efficiently as part of the question generation task.
Training an agent to solve tasks of different complexity and size is challenging. The procedure presented in (Shekhar et al., 2019) alternates between tasks, updating the hardest task more often. For this technique, finding the right schedule is cumbersome and requires fine-tuning. We rely on a more systematic training procedure based on random dataset-proportional batch sampling inspired by (Sanh et al., 2019). This represents a hardparameter sharing multi-task training procedure that avoids interference between tasks and favours a more stable training, which mitigates catastrophic forgetting (French, 1999).

Self-Play via Iterated Experience
Learning (SPIEL) Inspired by iterated learning (Kirby et al., 2014), we design a process by which the Questioner learns from games previously generated by other instances of the Questioner agent. We call our training procedure Self-play via Iterated Experience Learning (SPIEL). In SPIEL, described in Algorithm 1, we assume access to a set of images I and the bounding boxes O I of the objects therein. 1 In every gameplay, there ...

Unified Encoder-Decoder for Vision Language Pretraining (VLP)
Guesser head bowl ... ... < l a t e x i t s h a 1 _ b a s e 6 4 = " N 4 C U K 4 m Z P n T r 4 F Z n + L Q S z j W y V h o = " > A A A C S 3 i c b V D L b t Q w F H U G S k t 4 T W E F b C x G S G U z S h A I l h V s W A 4 S 0 1 a a R J H j 3 M x Y 9 S O y b 0 p H x u J r 2 M K X 8 A F 8 B z v E A s 9 j A V O O Z O n o n H N 1 r 0 / d S e E w y 3 4 k g 2 v X 9 2 7 s H 9 x M b 9 2 + c / f e 8 P D + i T O 9 5 T D l R h p 7 V j M H U m i Y o k A J Z 5 0 F p m o J p / X 5 2 5 V / e g H W C a M / 4 L K D U r G 5 F q 3 g D K N U D R 9 O j k y V 0 0 + 0 U A w X d e s X o f J R C c + q 4 S g b Z 2 v Q q y T f k h H Z Y l I d J o + K x v B e g U Y u m X O z P O u w 9 M y i 4 B J C W v Q O O s b P 2 R x m k W q m w J V + / Y d A n 0 a l o a 2 x 8 W m k a / X v C c + U c 0 t V x + T q U r f r r c T / e r X a 2 Y z t 6 9 I L 3 f U I m m 8 W t 7 2 k a O i q I N o I C x z l M h L G r Y i 3 U 7 5 g l n G M N a Z p o e E j N 0 o x 3 f i C 2 b l i l 2 G W l 7 7 o d R M D g H 6 U B 1 8 g X K L f 2 D S E k M Y 2 8 9 3 u r p K T 5 + P 8 5 T h 7 / 2 J 0 / G b b 6 w F 5 T J 6 Q I 5 K T V + S Y v C M T M i W c f C Z f y F f y L f m e / E x + J b 8 3 0 U G y n X l A / s F g 7 w + O D 7 K Y < / l a t e x i t > P (o1|ho 1 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " s c 2 w s 8 9 x 1 g u z D x p V W h P l 5 h T R d e Q = " > A A A C S 3 i c b V D L b t Q w F H U G C i W 8 p r A C N h Y j p L I Z J R W o L C v Y s B w k p q 0 0 i S L H u Z m x 6 k d k 3 0 B H x u J r 2 M K X 8 A F 8 B z v E A s 9 j A V O O Z O n o n H N 1 r 0 / d S e E w y 3 4 k g 2 v X 9 2 7 c 3 L + V 3 r 5 z 9 9 7 9 4 c G D U 2 d 6 y 2 H j k k j k 3 y 7 M O S 8 8 s C i 4 h p E X v o G P 8 g s 1 h F q l m C l z p 1 3 8 I 9 F l U G t o a G 5 9 G u l b / n v B M O b d U d U y u L n W 7 3 k r 8 r 1 e r n c 3 Y v i q 9 0 F 2 P o P l m c d t L i o a u C q K N s M B R L i N h 3 I p 4 O + U L Z h n H W G O a F h o + c q M U 0 4 0 v m J 0 r d h l m e e m L X j c x A O h H e f A F w i X 6 j U 1 D C G l s M 9 / t 7 i o 5 P R r n L 8 f Z u x e j k 9 f b X v f J E / K U H J K c H J M T 8 p Z M y J R w 8 p l 8 I V / J t + R 7 8 j P 5 l f z e R A f J d u Y h + Q e D v T + k L 7 K k < / l a t e x i t > P (o7|ho 7 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " B 7 I k I C u S Y t r f 3 X e j v j r e u U k 5 Z H e w Q C z y P B U w 5 k q W j c 8 7 V v T 5 1 J 4 X D L P u R D K 5 d 3 7 t x c / 9 W e v v O 3 X s H w 8 P 7 p 8 7 0 l s O U G 2 n s e c 0 c S K F h i g I l n H c W m K o l n N U X b 1 b + 2 Q e w T h j 9 H p c d l I r N t W g F Z x i l a v h w c m S q Y / q J F o r h o m 7 9 I l Q + K u F Z N R x l 4 2 w N e p X k W z I i W 0 y q w + R R 0 R j e K 9 D I J X N u l m c d l p 5 Z F F x C S I v e Q c f 4 B Z v D L F L N F L j S r / 8 Q 6 N O o N L Q 1 N j 6 N d K 3 + P e G Z c m 6 p 6 p h c X e p 2 v Z X 4 X 6 9 W O 5 u y e j z O X 4 y z d 8 9 H J 6 + 3 v e 6 T x + Q J O S I 5 e U l O y F s y I V P C y W f y h X w l 3 5 L v y c / k V / J 7 E x 0 k 2 5 k H 5 B 8 M 9 v 4 A k b + y m g = = < / l a t e x i t > P (o2|ho 2 ) Generator head no [MASK] [SEP] Figure 1: We use the single-stream VLP model as a backbone multi-modal encoder for our task. The visual features tokens (marked in red) are the FastRCNN features associated with the objects in the image, the history tokens (marked in blue) and the tokens to be generated (marked in yellow) are given in input to the model. A Guesser head uses the learned contextual object representations to generate a probability distribution over the objects P (o i |h oi ), whereas the Generator head is used to incrementally predict the masked tokens.
Algorithm 1 SPIEL: Self-Play via Iterated Experience Learning Initialise the experience buffer 4: for e ← 1, n do 5: Interactive phase 6: Q ← Q e load latest weights 7: G e ← GENERATE GAMES(I) 8: Transmission phase 12: for Priority to the latest games 14: if IS VALID GAME(g) then 15: if LEN(D e g ) == LEN(Dq) then break 17: Learning phase 18: is a Questioner Q and an Oracle O, initialised with agents Q 0 and O, respectively, that were trained with Supervised Learning using gold successful dialogues. 2 We consider every iteration e of the algorithm as a self-play epoch. In a single self-play epoch, we alternate 3 phases: Interactive phase: the agents play guessing games with novel combinations of image and target object. The generated dialogue can be successful if the predicted target object is equal to the target object. Every played dialogue is stored in an experience buffer E g .
Transmission phase: in this phase the datasets for the multi-task learning procedure for the Questioner are created. The generator head dataset D q is fixed in advance while the dataset for the guesser head D e g is created from the experience buffer E g by selecting the unique and valid dialogues. 2 The Oracle is fixed during this learning procedure.
Learning phase: the same multi-task learning procedure used in the supervised learning phase is used to fine-tune the Questioner parameters using the datasets D e g and D q collected for the current epoch e. This procedure is repeated n times or until a halting condition is reached (e.g. early stopping based on validation metric).
See Appendix A.1 for implementation details. At the end of the SPIEL procedure, we obtain the model Q n whose parameters can be reused in other tasks. Particularly, we use the parameters of Q n 's shared encoder Γ as initialisation for the fine-tuning on the downstream task t using dataset D t .

Implementation
We implement a shared multi-modal encoder Γ using VLP (Zhou et al., 2020), a single-stream multimodal Transformer for captioning depicted in Figure 1. During the GuessWhat?! fine-tuning, we extend VLP by including dialogue context in the input together with the features associated with the objects in the image. We learn two new segment ids to represent the question/answer exchanges in the dialogue, as described in (Wolf et al., 2019). The question is generated by incrementally replacing [MASK] tokens until the end of sequence is generated. See Appendix A.2 for more details. SPIEL training is run on a set of images I from Guess-What?! and TDIUC dataset with corresponding object annotations. We make sure that GuessWhat?! test images are not contained in I. This is not an issue for TDIUC test images because the downstream task annotations (QA pairs) are not used by the model during this phase. Once the model has been trained with SPIEL, we use the parameters of the shared encoder Γ as a backbone for a VQA model that is fine-tuned on the TDIUC dataset.

Experimental Evaluation
To assess the generality of our learned representations, we include two evaluation paradigms: 1) indomain evaluation and 2) transfer evaluation. We evaluate several variants of our model: 1) VLP-SL: VLP-based model trained on GuessWhat?! data using multi-task learning; 2) SPIEL-gs: VLP-SL model fine-tuned with our SPIEL procedure where the generator head uses only gold successful games (gs); 3) SPIEL-gm: same as 2) but both successful and failed gold games are used by the generator head. In both SPIEL variants, the guesser head is trained using failed and successful generated games because it is important for the guesser head to be exposed to both types of signal to learn a more robust policy. We decided to investigate the two variants SPIEL-gs and SPIEL-gm to get more insights about the effect that successful and failed games have on the generator head ability to produce effective dialogues.

In-domain evaluation
We use the CompGuessWhat?! evaluation suite (Suglia et al., 2020) to assess the ability of the Questioner to play guessing games and learn visually grounded representations in the process. It complements an evaluation based only on gameplay accuracy (de Vries et al., 2017) with 2 auxiliary tasks: target object 1) attribute-prediction expressed in terms of abstract attributes (A), situatedattributes (SO), abstract+situated attributes (AS), and location attributes (L); 2) zero-shot gameplay with near-domain accuracy (ND) and out-ofdomain accuracy (OD). Table 1 shows the comparison with previous state-of-the-art models on this benchmark such as de Vries et al. (2017)  VLP-SL has a greater advantage in terms of representation power compared to previous models. This is reflected in all the tasks of the CompGuessWhat?! evaluation. Particularly, we see better performance even for the zero-shot gameplay (ND: +5.6, OD: +15.2). This is because VLP associates a vector of probabilities that represents a distribution over the VisualGenome object classes with every object. This helps VLP to cope with the issue of unseen objects and helps the model to generalise. Learning to play is key to gameplay performance, leading to an increase of +4.4 over VLP-SL and +7.9 over GDSE-CL. In this setup, the difference between the versions SPIEL-gs and SPIEL-gm is very minimal (0.1). However, when analysed in more detail, we can see that training the questioner with gold successful data only improves attribute prediction while using mixed data improves overall generalisation in the zero-shot evaluation.

Transfer evaluation
For the transfer evaluation, we use the VQA dataset TDIUC (Kafle and Kanan, 2017). It provides a finer-grained way to assess the quality of the representations learned by our guessing game transfer technique in terms of several question types including object categories and their attributes. Specifically, we were interested in improving on the following question types: 1) Positional reasoning; 2) Counting; 3) Object presence; 4) Utility/Affordances; 5) Attribute; 6) Color; and 7) Object recognition. TDIUC is evaluated using the arithmetic mean accuracy per question type (A-MPT), as well as the harmonic mean (H-MPT) that better captures the skewed question-type distribution. In Table 2, we report a comparison between variants trained on guessing games data (VLP+SL and SPIEL-* ), the original model VLP  Table 4.
Among them, MUREL achieves the best scores across the board, due to a custom iterative reasoning mechanism and a non-linear fusion module. However, all our models have a more balanced overall performance which results in better harmonic means (H-MPT, +5 points over MUREL). Specifically, this improvement is favoured by an increase in accuracy on the Utility/Affordances ques-  tion type (+20.7). As shown by the attribute prediction in the CompGuessWhat?! and depicted in Figure 2 (c), our models learn better representations than competitors specifically for abstract attributes among which there are object affordances. Particularly, we can see how it is able to understand that certain objects can contain things (e.g. "the one with the soup in it?"), that objects have specific functions (e.g. "are the contents of the plate edible?") or that they have specific properties (e.g. "a spoon is made of wood"). The effectiveness of the proposed fine-tuning procedure is confirmed by the improved performance across all the question types compared to our baseline VLP+CC. Models such as MUREL and MCB-* equipped with specific VQA modules have an advantage on specific question (e.g., positional reasoning) compared to VLP that relies only on BERT self-attention layers (Devlin et al., 2019). In addition, when comparing the two SPIEL variants, a similar trend showed in the in-domain evaluation can be observed. Particularly, SPIEL-gm benefits from being exposed to more language data coming from successful and failed guessing games.

Conclusions
In this work, we verified that representations learned while playing guessing games can be transferred to other downstream tasks such as VQA. We presented two ways of learning from guessing games data namely multi-task learning and SPIEL. Models using SPIEL performed better both on indomain evaluation on CompGuessWhat?! as well as on the transfer task TDIUC. Our self-play pro-  Table 2: Results for the transfer evaluation on TDIUC. The models are divided in two categories: (top) Models specifically designed for VQA and (bottom) our VLPbased implementations. We report only the question types that we believe will benefit from the guessing games fine-tuning procedure. For the full set of results please refer to Appendix, Table 4. cedure was able to learn useful and finer-grained object representations such as object affordances, thus demonstrating that learning to guess helps learning to ground.
The current study showed how we can apply the SPIEL training procedure to a VQA dataset such as TDIUC. We believe that this work can be extended to other datasets because the SPIEL procedure only requires a set of images and associated object bounding boxes. These could be either gold or generated by a trained object detector therefore classifying guessing games as a holistic self-training procedure for multi-modal datasets.

A Appendices
A.1 Self-Play via Iterated Experience Learning (SPIEL) Learning to replicate gold dialogues is not enough to play successfully. High performance in gameplay can be achieved only when the agents start playing the game and are exposed to their own mistakes. Reinforcement Learning (Strub et al., 2017) or Collaborative Learning (Shekhar et al., 2019) are possible approaches to tackle this problem. Inspired by iterated learning (Kirby et al., 2014), we design a process by which "the gameplay arises in one instance of the questioner through induction on the basis of observations of gameplay in other questioner agents who acquired that gameplay capability in the same way". Therefore, we call our procedure Self-play via Iterated Experience Learning (SPIEL).
In this setup, we assume we have access to a set of images I and for each image I we have object bounding boxes O I . The SP training procedure, showed in Figure 1, can be described as follows. We assume that there is a Questioner agent Q and an Oracle agent O. At the beginning of the procedure they are initialised with agents Q 0 and O, respectively, trained with Supervised Learning using gold successful dialogues 3 . We consider every iteration e of the algorithm as a self-play epoch. In a single self-play epoch we alternate 3 phases: 1) interactive phase: the agents play guessing games with novel combinations of image and target object; 2) transmission phase: the questioner creates new datasets from the dialogues generated over the epochs; 3) learning phase: multi-task learning is used to fine-tune the Questioner parameters using the datasets collected for the current epoch.

A.1.1 Interactive phase
We start the interactive phase by first sampling a set of reference games G e which consists of pairs (I,ô) where I ∈ I andô is the target object sampled at random from the object annotations O I . The agents Q e and O play the games G e and accumulate the generated experiences. During this phase, the questioner agent is using the most updated weights generated at epoch e − 1. It generates questions by nucleus sampling (Holtzman et al., 2019) from the probability distribution over the vocabulary learned by the generator head. When the [STOP] token is sampled, the guesser head, conditioned on the dialogue generated so far, selects the objectõ with the highest probability. A game is successful if the predicted objectõ is equal to the target objectô.

A.1.2 Transmission phase
For every epoch e, in the transmission phase, we create the datasets D q and D g for the questioner and guesser heads, respectively, used in the learning phase for the questioner parameters update. Questioner experience buffer To make sure that the questioner does not experience language drift , we consider a fixed dataset D q composed of dialogues generated by humans contained in the GuessWhat?! training data. The shared encoder Γ benefits from this data too because it is still exposed to human generated language, which guarantees better generalisation.
Guesser experience buffer The Guesser should learn from its own mistakes -therefore we use generated dialogues for the model updates (de Vries et al., 2017;Shekhar et al., 2019). Inspired by Prioritised Experience Replay (Schaul et al., 2015), we create the experience buffer for the guesser E e g by accumulating all the unique and valid dialogues 3 The Oracle is fixed during this learning procedure. generated until epoch e. We consider a dialogue unique if D e g does not contain another dialogue with the same encoding 4 . In addition, we consider a dialogue valid if it does not contain repeated questions. We cap the number of dialogues in D e g so that it matches the number of experiences in D q . This is done so that during the multi-task training procedure there is an equal number of dialogues for each task from which the agent will learn.

A.1.3 Learning phase
In this phase, we use the same multi-task training procedure that was used during the supervised learning phase. We update the Questioner parameters using the dialogues collected in D q and D e g . The updated parameters resulting from this step will be used for the self-play epoch e + 1.

A.2.1 Multi-modal encoder
To implement the agents in our guessing games, we rely on VLP, a single-stream multi-modal model (Zhou et al., 2020) that jointly learns visual and language representations using Conceptual Captions (CC) dataset (Sharma et al., 2018). The input starts with a classification token ([CLS]), followed by a series of K visual tokens, a separation token ([SEP]) divides the dialogue sequence from the visual and from the sequence of tokens to be generated. In a guessing game, we represent the reference image I as a set of image regions extracted from an off-the-shelf object detector {r 1 , r 2 , . . . , r K }. Following (Zhou et al., 2020), each region r i is represented by linear transformation of a feature vector f ∈ R dn , region class probabilities c ∈ R dc and region geometric information g ∈ R do where d o = 5 consists of four values for top left and bottom right corner coordinates of the region bounding box (normalized between 0 and 1) and one value for its relative area (i.e., ratio of the bounding box area to the image area, also between 0 and 1). The Questioner models uses at most 36 predicted bounding boxes from FastRCNN while the Guesser is using features generated by FastR-CNN for gold bounding boxes. We use a specific segment id s v for every region.
For the language part, we use Wordpiece embeddings (Wu et al., 2016). In particular, we flatten the turns of the dialogue context as a sequence of tokens. However, to allow the model to differentiate between question and answer tokens, following (Wolf et al., 2019), we rely on novel segment ids (s u ,s a ). The VLP's hidden state of the [CLS] token is used as context representation h c .

A.2.2 Oracle design
The implementation of the Oracle follows the one presented in the original VLP paper to solve the VQA task (Zhou et al., 2020). Particularly, the model predicts a probability distribution over the possible answers by using a multi-layer feedforward neural network that receives in input the element-wise product between the hidden state associated with the [CLS] token and the hidden state associated with target object. The model is optimised by minimising the cross-entropy loss using as training dataset the question/answer pairs in the successful GuessWhat?! training dialogues.

A.2.3 Questioner design
We rely on the VLP ability to generate captions for the question generation task. In particular, we provide in input to the model: 1) predicted FastR-CNN visual features following (Zhou et al., 2020); 2) dialogue generated so far as a flattened sequence of tokens; 3) question to be generated. We use another segment id s q to allow the model to differentiate what is the input and which are the tokens to be generated. Following (Dong et al., 2019), we make sure that the attention mask for tokens of the question to be generated are masked so that the token at timestep t is not allowed to attend to the future tokens (seq2seq attention mask). For this specific model, we use the masked language modelling objective (Devlin et al., 2019) casting the task as multi-modal masked language modelling.

A.3 GuessWhat?! evaluation
Oracle evaluation We report the test accuracy for the Oracle of 82.22%. The baseline model used by all the other is 78.5% (de Vries et al., 2017). Table 3 the accuracy of the guesser in predicting the target object when gold dialogues are given in input. We compare this model with several baselines reported in (de Vries et al., 2017) (first block), more sophisticated methods such as ParallelAttention (Zhuang et al., 2018) and GDSE-* (Shekhar et al., 2019) (second block) as well as other Transformer-based models such as VILBERT (Lu et al., 2020) Table 4: Summary of results for the transfer evaluation on TDIUC. The models are divided in two categories: (1) Models which are specifically designed for VQA (top) and (2) models that rely on the VLP encoder to generalise to different downstream tasks (bottom). We underline the question types that we believe will benefit from the guessing games transfer/fine-tuning procedure.