Diversity and Consistency: Exploring Visual Question-Answer Pair Generation

Although showing promising values to downstream applications, generating question and answer together is under-explored. In this paper, we introduce a novel task that targets question-answer pair generation from visual images. It requires not only generating diverse question-answer pairs but also keeping the consistency of them. We study different generation paradigms for this task and propose three models: the pipeline model, the joint model, and the sequential model. We integrate variational inference into these models to achieve diversity and consistency. We also propose region representation scaling and attention alignment to improve the consistency further. We ﬁnally devise an evaluator as a quantitative metric for consistency. We validate our approach on two benchmarks, VQA2.0 and Visual-7w, by automatically and manually evaluating diversity and consistency. Experimental results show the effectiveness of our models: they can generate diverse or consistent pairs. Moreover, this task can be used to improve visual question generation and visual question answering.


Introduction
Teaching a machine to generate question-answer pairs (QAPs) from images can benefit a lot of downstream applications such as child education (Wang et al., 2018), visual dialog (Das et al., 2017), generating verification code for websites, visual question generation (VQG) (Krishna et al., 2019) and visual question answering (VQA) (Wu et al., 2016). For example, training VQA models requires large scale labelled data, which is usually labour intensive and expensive to construct. Meanwhile, bias still exists in large QA datasets (Goyal et al., 2017), including domain coverage, question and answer types, and linguistic style. Therefore, as an alternative to constructing datasets manually, QAP generation can promote VQA further.
In this paper, we first study a novel task that aims to generate QAPs from visual images, namely VQAPG (Visual Question Answer Pair Generation). As shown in Figure 1, it has two challenges: diversity and consistency. On the one hand, even a simple image will contain various content that could be asked. The focus of questions could range from appeared objects and their features to the global attributes, such as the image background and photo time. On the other hand, generating consistent QAPs is also critical. When we say a QAP is consistent with the image, we mean two points: the question is answerable with the image, and the answer is correct for the question. Therefore, keeping the consistency is difficult because it requires the generation model to simultaneously guarantee the correctness of both questions and answers.
There are two related tasks with VQAPG, but neither serves as suitable prior art. The first is VQG (Zhang et al., 2017;Patro et al., 2018;Krishna et al., 2019), which produces questions given answers or other knowledge. Another similar task is VQA that aims to answer given questions (Goyal et al., 2017;Su et al., 2020;Lu et al., 2019). They can be viewed as two subtasks of VQAPG. However, simply combining the two is not an ideal substitute for VQAPG since they condition another's output for learning.
We study several generation paradigms to perform VQAPG. The first is a pipeline model that generates questions and answers one after another. Next, inspired by the non-autoregressive text generation (Ren et al., 2020), we propose a joint model that generates questions and answers in parallel. To reduce the model size, we also propose a sequential model that concatenates the two targets into one. We integrate latent variable(s) into these models through variational inference (Kingma and Welling, 2014) to improve diversity. If fed with different latent variables sampled from the prior distribution, the model can generate various QAPs. Besides, we observe that if we grid an image into multiple regions, the target question and answer are only related tightly with only part of them. As the latent variable contains information of the target QAP, we use it to make the model concentrate on those related regions by scaling their representations. To improve the joint model's consistency, we align the attentions of the question decoder and the answer decoder to make them focus on similar regions.
A remaining issue of VQAPG is how to measure the consistency automatically in addition to manual inspection. Traditional popular metrics, such as BLEU (Papineni et al., 2002) and ME-TEOR (Banerjee and Lavie, 2005), only tell the overlapping degree between generated QAPs and the target ones, being insufficient to indicate the consistency. Targeting this issue, we devise a consistency evaluator trained with an adversarial strategy. We assume the samples in the dataset as consistent and construct inconsistent samples through shuffling images, answers, questions, or all of them. This evaluator will give a high score if the generated QAP is consistent with the image and a low score if not.
We conduct experiments on two datasets, VQA2.0 (Goyal et al., 2017) and Visual-7w (Zhu et al., 2016), in both of which each image has multiple human-created QAPs. The quantitative results indicate that each model has its strengths. The pipeline model achieves the best diversity, while the sequential model achieves the best consistency. However, they can not boost both together. In contrast, the joint model improves both diversity and consistency impressively through variation. More ablation studies illustrate the effectiveness of our proposed methods: both the region representation scaling mechanism and the attention alignment mechanism improve consistency significantly. We also evaluate generated QAPs manually and give case studies. Moreover, to prove the effectiveness of VQAPG on downstream applications, we use VQAPG to generate pre-training samples for VQG and VQA. Results indicate VQAPG can enhance the performance of both VQG and VQA models.
In short, our contribution mainly includes four parts: i) We propose a novel task, VQAPG, that targets to generate diverse and consistent questionanswer pairs from images. ii) To perform VQAPG, we study multiple generation paradigms and propose three models. We also design a consistency evaluator. iii) We incorporate variational inference into models and propose a series of techniques, including region representation scale and attention concentration to improve diversity and consistency. iv) We conduct comprehensive experiments on two large scale datasets. The results show the effectiveness of our approach to generate diverse and consistency QAPs and benefit other applications.

Related Works
Visual Question Generation is an interesting task emerged in recent years. Question generation is firstly studied on text (Heilman and Smith, 2010;Labutov et al., 2015;Du et al., 2017;Zhou et al., 2018;Sun et al., 2018;Ma et al., 2020;Kim et al., 2019). While related studies on images has received little attention. Existing methods in this field are typically based on learning algorithms (Mostafazadeh et al., 2016;Zhang et al., 2017). Such methods are often incorporated with the variational process (Jain et al., 2017;Krishna et al., 2019). Visual question generation is also conducted together with visual question answering (Li et al., 2018;Sun et al., 2020).
Visual Question Answering has received more interest thanks to available public datasets such as VQA2.0 (Goyal et al., 2017) and Visu-alGenome (Krishna et al., 2017). Through finetuning on large pre-trained models Su et al., 2020;Lu et al., 2019), the performance has been improved considerably. However, it still requires large scale labeled datasets, which is too consuming to annotate manually. Therefore, a successful VQAPG system would be beneficial to reduce such costs.
Question Answer Pair Generation on images  Figure 2: Overall architecture of our three models for VQAPG. They share the same image encoder architecture. The variation module is optional to produce latent variables. Here, "X" and "Y" represent question and answer, or vice visa. "Target" represents the sequence for estimating posterior distribution, such as question, answer, or both. "<sep>" is a reserved token to separate question and answer.  (Su et al., 2021). As answers could not be extracted directly from images and there are no candidate ones, the above methods can not be simply applied to the image.

VQAPG Task
Given an image denoted as I, VQAPG aims to produce diverse QAPs, each of which contains an answerable question Q and its correct answer A under I. Both the question Q = q 1 , q 2 , · · · , q m and the answer A = a 1 , a 2 , · · · , a n are sequences, in which q i and a i represent tokens. Figure 4 compares VQAPG with typical VQG and VQA task. Our final goal is to obtain a model to approximate the true data distribution P (Q, A|I) so that we can sample questions and answers from it. Because VQAPG is a one-to-many task, we also expect this model to support sampling of diverse and consistent QAPs.

Approach
We propose three models to perform VQAPG, including the pipeline model, the joint model, and the sequential model. As shown in Figure 2, all of them adopt the encoder-decoder arcitecture.

Preliminary
All three models rely on the same image encoder to embed images into vector space. It contains a convolutional neural network (CNN) and a contextaware sequence model. The former transforms the image into a feature map of size C ×H ×W , which can be viewed as embeddings of R 1 regions with dimension C. The latter encodes further the feature map. We use h I = (h 1 , . . . , h R ) to represent the final image representation, in which h i is the i-th region representation.

Model
Log Likelihood ELBO Each model has two versions, the baseline and the variational. The goal of the baseline is to maximize the conditioned likelihood, which is estimated by P Ω , where Ω denotes parameters of the model. Table 1 summarizes objectives of all models.
However, only maximizing the likelihood is insufficient to generate diverse QAPs, so we incorporate variation inference (Kingma and Welling, 2014) (as shown in Figure 3). The key is to estimate the prior and posterior distributions of the latent variable represented as Z. We define both distributions as isotropic Gaussian and design a prior encoder and a posterior encoder for estimation. We use the symbol Φ to represent parameters of the prior encoder and Ψ for the posterior encoder. Both of them use a multi-layer perceptron to produce the mean and variance of distributions. We exploit reparametrization trick (Kingma and Welling, 2014) to allow backpropagation of gradients. With latent variables, the objective of variational models is to maximize the evidence lower bound (ELBO) as shown in Table 1 (The derivation details of ELBO are shown in Appendix.).
Since one image contains multiple regions and the target QAP will focus only a few of them, we argue that we should allocate different weights to these regions and encode them into the latent variable. So we exploit a text-image attentional module for both prior and posterior encoder. Given the image encoding h I and a sequence encodingĥ, the attention result is: where M represents a multi-layer perceptron.

Pipeline Model
The pipeline model contains two sub-models that conduct question generation and answer generation separately. The generation order of question and answer is changeable. Therefore, we use X to represent the first generated and Y for the second and use X-model and Y -model to represent two corresponding sub-models, respectively. The Y -model takes gold X as input during training, and predicted X produced by X-model during inference.
In the baseline version, the two sub-models just aim to maximize the likelihood, as shown in Table 1. In the variational version, we add two latent variables, Z X and Z Y . The former is used to control the diversity of X and the latter is for Y . As shown in Figure 3, the prior distribution and posterior of Z X are estimated as: where h X is the representation of X obtained from a sequence encoder, "•" represents composition. For Z Y , the prior and posterior are: where h Y is the representation of Y , "⊕" indicates concatenation.

Joint Model
Inspired by the non-autoregressive text generation (Ren et al., 2020), we propose a joint model that generates questions and answers in parallel with two decoders as shown in Figure 2. The joint model assumes Q and A are independent conditioned on I and optimizes two decoders separately without considering each other. As an image could map to multiple QAPs, the consistency of generated QAP can not be guaranteed in the baseline joint model. On the other hand, the latent variable Z contains the information of the target QAP. Therefore, the consistency can be improved by introducing Z. As shown in Figure 3, the prior of Z is estimated as P Φ (Z|I), alike with P Φ X (Z X |I) in Equation 2, and the estimated poste- To improve the consistency further, we argue that the two decoders in the joint model should focus on similar regions for better consistency. Therefore, we align the attention to regions used in the two decoders. Specifically, we add an attention alignment term L attn to the objective of the joint model: β i is the attention of the question decoder at time step i and γ i is the attention of the answer decoder. They are computed alike with α in Equation 1. λ is a scalar weight.

Sequential Model
Both the pipeline model and the joint model are redundant since they require separate modules to generate QAPs. And errors introduced by the pipeline or separate training could result in inconsistency. Therefore, we propose a sequential model with just one decoder as shown in Figure 2. The sequential model concatenates question and answer as an integral sentence and inserts a reserved token "<sep>" between them. This sentence is denoted as T . Similar to the pipeline model, the order of question and answer is also changeable. The baseline sequential model aims to maximize directly the log likelihood represented as P Ω (T |I). In the variation version, the prior P Φ (Z|I) is alike with P Φ X (Z X |I) in Equation 2, and the poste-

Region Representation Scaling
We initially use the latent variable Z to transform the state of the decoder. To make full use of Z, we also propose a novel strategy to scale the region representation. The core idea is that since the target QAP is tightly related to only a few regions in the image and its information is included in the latent variable, we can scale the region representations to highlight those related and weaken those unrelated before decoding. Specifically, we assign a weight to the representations of each region: Then we use the scaled representation for the subsequent decoding. Note that this scaling mechanism can be used in all variational models.

Consistency Evaluator
Consistency evaluation is critical in our work. However, there is no existing automatic metrics. Inspired by the work in semantic evaluation (Wieting et al., 2019), we devise an evaluation model to measure the consistency of generated QAPs with given image: It will return a scalar score s between [0,1]. The score will be high if the QAP is consistent with the image. Here, h Q and h A are the representation of question and answer, respectively. Training this evaluator requires both positive and negative samples. We take all original samples in the dataset as consistent, i.e., positive. We build the negative samples dynamically for each mini-batch, which contains images, questions, and answers. Specifically, the negative samples are generated via selecting one of the following four actions randomly and applying it to those positive in the minibatch: 1) shuffling images, 2) shuffling questions, 3) shuffling answers, and 4) shuffling all of them. Then we feed the model with both the positive and the generated negative. The evaluator is trained with mean square error loss.

Experiment Setup
In this section, we give a description of datasets and evaluation metrics. Other settings including implementation details are in the Appendix.

Dataset
We conduct experiments on two visual question answering datasets, VQA2.0 (Goyal et al., 2017) and Visual-7w (Zhu et al., 2016). In these two datasets, each image could map to multiple target QAPs. Because the official test set of VQA2.0 is not public, we use the official development set as the test set in our paper and randomly select ten thousand samples from the train set as our development set. For the Visual-7w, we take no extra operations. Table 2 shows the statistics of these two datasets 2 .

Evaluation Metrics
In the following experiments, we evaluate our models with both automatic metrics and manual inspection. We mainly focus on the diversity and consistency of generated results. On the diversity side, we use Distinct (Li et al., 2016), a common metric for diversity, to measures the ratio of unique n-grams in the text. We adopt Distinct-4 (denoted as D) to evaluate the generation by concatenating the question and answer together. We also report the number of unique QAPs of the generation result (denoted as N). On the consistency side, we use our consistency evaluation model (Section 4.6) to indicate whether the QAP is consistent with the image. If the output score is greater than 0.5, we consider the result is consistent, otherwise inconsistent. We report two metrics for consistency, the percentage of consistent QAPs (denoted as P), and the average score of generation result (denoted as S).
We also evaluate the diversity and the consistency manually on the VQA2.0 dataset. Specifically, four human annotators perform diversity and consistency evaluation on randomly selected images. Each image could contain three to ten generated QAPs. We ask every human annotator to rate the QAPs in terms of the above two metrics. The evaluation result will be transformed into a score of [0,1] (higher score means better performance). Detailed guidelines for different ratings are provided to the human judges (see Appendix).

Implementation Details
In all experiments, we use pre-trained ResNet-50 as the CNN encoder. Both the sequence encoder and decoder are long short-term memory networks with two layers. We do not tune the hyperparameters elaborately towards the dataset. Therefore, all models on the two datasets share the same parameter settings. All the representations and latent variables are 512-dimensional vectors. The question and answer share the same dictionary that keep all tokens. The word embedding is initialized using Glove. We dropout all models with a ratio of 0.1. To avoid posterior collapse, we use free-bits of 5 to the KL term in the ELBO. We also use free-bits of 0.03 to the KL term in the attention alignment loss, to allow existence of divergence between the two attentions. We set the weight λ of L attn to 0.5. We train all the model 40 epochs with batch size 256 on two Nvidia Titan RTXs. The parameters are updated by Adam optimizer , with the initial learning rate 1e-3. The learning rate decays with a ratio of 0.5 if the model has not improved for five consecutive epochs on the development set.
The VQA model and the VQG model in the paper share the same encoder-decoder architecture. More specifically, they share modules with the Ymodel in the baseline pipeline model. They take the image, the answer (for VQG) or the question (for VQA) as input for generation. The hyperparameters for the baseline VQA, the baseline VQG, and pre-training remain consistent with those mentioned above. Except for the initial learning rate and training epochs, other hyper-parameters keep unchanged in the fine-tuning process.
Finally, all models are implemented using the framework Fairseq 3 , and the source code is available at https://github.com/LtECoD/vqapg.

Performance of Consistency Evaluator
We present the performance of our consistency evaluator in Table 3. We can find the evaluator performs well to distinguish consistent and inconsistent samples, as the accuracy on the training set and development set exceeds 85% already. Even on the test set, which contains only consistent samples, the accuracy can approach 90%. It indicates the evaluator is competent to measure the consistency.

Quantitative Analysis
The performance of our three models on the two datasets is shown in  generate answers first. On both datasets, the variational pipeline model obtains the best diversity. By referring to the Table 2, we can find there are about 75% (on VQA2.0) and 50% (on Visual-7w) generated QAPs are unique. The baseline sequential model achieves the best consistency, which even reaches 98.95. Compared with these two models, the joint model is much inferior to both diversity and consistency. On the other hand, by comparing the baseline and the variational models. It is obvious that the benefit of variational to diversity is significant for the pipeline and joint models. For example, the variational pipeline's Distinct is almost ten times the baseline on the VQA2.0. However, the diversity gain is not evident for the sequential model. As for the consistency, both the pipeline model and the sequential model decreases, especially for the pipeline model. However, for the joint model, the consistency even raises from 23.61% to 79.34% on the visual-7w, achieving a considerable improvement and even surparsing the pipeline model.
These results prove that: 1) Although diversity is largely improved for the pipeline model, noises will also be introduced through latent variables, thereby damaging consistency; 2) The latent variable contains information of the target QAP and reduces the target space so that the consistency is largely improved for the joint model; 3) The baseline sequential model's diversity is improved slightly by variation, indicating it is not as sensitive to the latent variable as the former two models.

Ablation Study
To verify the effectiveness of our proposed methods, including the region representation scaling and attention alignment. We study their impact on the three models. Table 5 shows the results. As observed, the scaling mechanism shows a benefit to the consistency for all models. Especially for the joint model on VQA2.0, the consistent percentage raise from 55.39 to 64.02, achieving a 8.63 improvement. However, it could also reduce diversity as the diversity metrics will suffer a little drop in most cases. It indicates there is a trade-off between diversity and consistency. On the other hand, removing attention alignment loss L attn leads to significantly lower consistency for the joint model. Besides, the diversity is also reduced slightly. It indicates that L attn contributes to improving consistency without sacrificing diversity.
We also investigate the impact of generation order on diversity and consistency for the pipeline model and the sequential model. We compare the models with different generations orders on the two datasets. Figure 5 shows the results. The impact of generation order on the pipeline model is evident. It shows the pipeline model that generates answers first tends to obtain a good diversity while the model generates questions first could obtain a better consistency. Compared to the pipeline model, the impact of generation order on the sequential model is slighter because the volatility caused by the generation order is small. However, it shows a contrary phenomenon that the sequential model generating question first obtains a better diversity and the sequential model generating answer first obtains a better consistency. Such evidence indicates the trade-off between diversity and consistency again.

Human Evaluation
To better evaluate the quality of the generated QAPs, we conduct diversity and consistency evaluation manually. We evaluate three variational models. We randomly select 50 shared images from the test set of VQA2.0 and give their generated QAPs to annotators. For each image, we remove extra repeated QAPs.  model obtains the highest diversity score, and the sequential obtains the lowest, indicating the same result as the quantitative evaluation. However, for the consistency score, the joint model shows an obvious advantage over the pipeline model. Nevertheless, the sequential model only achieves a consistency score of 0.73, not well-matched with the result in Table 4 which indicates more than 97% generated QAPs are consistent. We argue that the pipeline and the sequential models are good at capturing the linguistic features of QAPs (such as co-occurrence) because there is information flow between the question and answer. However, the joint model is blind to such information and can only capture them through latent variables. Consequently, the joint model obtains lower automatic consistency metrics but a high human evaluation score. It also indicates the benefit of the latent variable to improve the consistency of the joint model.

Case Study
We present several generated QAPs given the second image in Figure 1. The examples are shown in Table 7. As observed, all three models can detect the objects in the image. The sequential model   even asks nonexistent object "printer", but it also gives correct answer "nowhere". However, it also reveals some problems. Although they can generate rational questions, the joint model and the pipeline model could give wrong answers. The sequential model shows no apparent errors, but it still produces QAPs that humans cannot judge. It indicates that the models could be heavily dependent on the linguistic features and ignore the association with the image.

Effect to VQG and VQA
To illustrate the effectiveness of VQAPG to downstream applications, we use three variational models to generate five times training QAPs to pre-train a VQG model and a VQA model. The implementation of the two models is shown in the Appendix. Specifically, we pre-train the two models on the generated corpus and fine-tune them on the original training set. The number of epochs is 20, and the initial learning rate is 3e-4 for fine-tuning, while other hyper-parameters remain unchanged (see Appendix). We use BLEU-4, METEOR, and ROUGE-L as evaluation metrics 4 . Table 8 shows the performance of the baseline VQG and VQA model and their three fine-tuned versions. We can observe that pre-training on the corpus generated by the variational pipeline model and the sequential model can improve VQG and VQA more significantly than the joint model. Overall, such evidence indicates the benefit of the VQAPG task to the VQG and VQA. Another observation is that the variational pipeline model and the variational sequential model contribute differently to VQG and VQA. By referring to the analysis mentioned above of diversity and consistency, we can conclude that VQG focuses more on diversity, while VQA focuses more on consistency.

Conclusion
In this paper, we propose a novel task, VQAPG, which generates question-answer pairs from images. We also propose three models to perform this task. Targeting on diversity and consistency, we integrate variational inference to these models and propose a series of actions, including region representation scaling and attention alignment. To evaluate the consistency automatically, we devise an evaluator. We evaluate our models on two datasets: VQA2.0 and Visual-7w. The results show each model has its own merits. Overall, they perform well to generate diverse and consistent QAPs. On the other hand, there are still limitations in our works. For example, there is a trade-off between diversity and consistency; the generated question is typically one-hop, requiring no extra reasoning; the latent variable is uncontrollable and could introduce unexpected linguistic features to the decoder, bringing inconsistent QAPs; the consistency evaluator needs more robust training strategy. In future works, we will explore generating consistent deep question-answer pairs.

A Autoregressive Generation
We adopt an autoregressive generation for sequences in this paper. Therefore, all the likelihood can be expanded further. For example:

B Derivation of ELBO B.1 The Pipeline Model
The pipeline model contains two sub-models, so there are two ELBOs. Firstly, for the X-model, the Kullback-Leibler (KL) divergence between the estimated posterior distribution and the true posterior distribution of Z X is: It can be expanded as: Then: Because we use P Φ X (Z X |I) to estimate the true prior P (Z X |I), and use P Ω X (X|I, Z X ) to estimate the likelihood P (X|I, Z X ). We can get the ELBO: Similarly, for the Y -model, the KL divergence between estimated posterior and true posterior of Z Y is: Through the sample derivation process of ELBO for X-model, we can get the ELBO of Y -model is:

B.2 The Joint Model
We also first present the KL divergence between the estimated posterior and the true posterior of Z: Then the KL divergence is expanded as: We estimate the true prior P (Z|I) with P Φ (Z|I). And the joint model assumes P (Q, A|I) = P (Q|I)P (A|I), so P (Q, A|I) is estimated as P Ω Q (Q|I)P Ω A (A|I). Then the ELBO for the joint model is:

B.3 The Sequential Model
As the sequential model concatenates question and answer into an integral sequence, the derivation of ELBO is same as the X-model in the pipeline model.

C.1 Diversity
We ask human annotators to inspect the diversity of a group QAPs generated from a common image and score them from two aspects: the question type and the objects that appeared. The annotator computes the percentage of unique question types and unique objects. We then average them as the diversity score for the QAP group. Taking the results of the joint model in Table 7 of the paper as an example. This group of QAPs includes four question types: "what kind", "what is", "what brand", and "what color". The appeared objects are "computer", and "screen". Therefore, the diversity score for this group is (4/4+2/4)/2 = 0.75. The overall score of all samples is the average of all groups. We also take the number of QAPs into consideration for diversity score. Specifically, we normalize the model's final diversity score with a ratio, which is computed by dividing the count of model's generated QAP by the maximum count of all models. The final diversity score is the average of all annotators.

C.2 Consistency
Human annotators evaluate the consistency for each QAP from two aspects as well. One is that is the question answerable for the given image. Another is that is the answer correct. The scoring criteria is: 0-the question is not answerable. 1-the question is answerable, but the answer is incorrect; 2-the question is answerable, and the answer is correct. For those answers that human annotators cannot judge the correctness, we take them as consistent. Taking output of the joint model in Table  7 of the paper as an example. The score of "What is on the screen? Windows" and "What brand is the computer? Apple" is 2. While the score of "What color is the computer? Unknown" is 1. Then this score is divided by 2 as the final consistency score. Same as the diversity evaluation, the final consistency score is the average of all annotators.

D.1 Automatic Evaluation Results
In addition to metrics of diversity and consistency introduced in the paper, we also measure more ngram based metrics, including BLEU-4, METEOR and ROUGE-L. Table 9 shows the result. As we can see, all the baseline models achieve higher ngram scores than the variational models. Among the three models, the pipeline model drops most from the baseline to the variational. On the visual-7w dataset, the baseline sequential model wins all metrics. However, on the VQA2.0 dataset, it is the baseline joint model that wins the BLEU and ROUGE. By comparing with the paper results, we can find that better BLEU, METEOR, or ROUGE does not mean better diversity and consistency. Therefore, such n-gram based metrics are insufficient to measure the degree of diversity and consistency, although they reflect the overlapping between generated results and gold references.

D.2 Region Representation Scaling
To inspect the effectiveness of our proposed region representation scaling mechanism. We randomly select an image and visualize region weights w of every model. The results are shown in Figure 6. The pipeline model generates the answer first. As we can see, the answer-model allocates bigger weights to a few regions and generates answer "tennis", indicating it detects the answer information from the latent variable successfully. While the question-model allocates weights more broadly and produces the question "What sport is this?", indicating question generation requires focus more regions of the image. Both the joint model and sequential model illustrate a similar phenomenon with the questionmodel of the pipeline, i.e., they require more regions for generation. Although the joint model generates a consistent QAP, the visualized weights indicate the answer "man" refers to the man behind. The sequential model is good proof. By analyzing its generated QAP, we can easily find the answer "Blue" refers to the man behind, instead of the man in front. All these evidence imply the three models can mine the QAP information from latent variables.

D.3 Attention Alignment
To inspect our proposed attention alignment mechanism for the joint model, we also visualize the average attentions of the question decoder and the answer decoder by randomly selecting an example as shown in Figure 7 and Figure 8. We sample two QAPs generated by the joint model. One is "Where was this photo taken? In the city.", and another is "How many people are there? One.". Obviously, the first is consistent, while the second is not because the right answer should be "two". Figure 7 shows two average attentions for the consistent QAP. As we can see, the two attentions are very close. It indicates the attention alignment mechanism works successfully for this QAP generation.
On the other hand, Figure 8 shows two attentions for another inconsistent QAP. We can easily find the divergence between these two attentions. More interestingly, the answer is "one" and its attention only cover the left people. But the correct answer is "two," and the question attention covers two peoples in the image correspondingly.

E Model Size
We report parameter numbers of all models in Table 10. As observed, the size decreases from the pipeline model to the joint model. The variational pipeline is 50M bigger than the variational sequential model. Since the capacity of small models is limited, the model size provides a possible interpretation for the disadvantage of the sequential model in diversity.

Model
Size (