QaDialMoE: Question-answering Dialogue based Fact Verification with Mixture of Experts

,


Introduction
Fact Verification, aiming to validate the factuality of claims against a corpus of documents, is an important NLP area (Cohen et al., 2011) and has been explored to various applications such as Figure 1: Two examples of question-answering dialogue based fact verification. The first response is retrieved from a real-world question about COVID-19. The second response is derived from a QA corpus using an ambiguous information-seeking question. detecting fake news, rumor, and deceptive opinions (Rashkin et al., 2017;Thorne et al., 2018;Goodrich et al., 2019;Vaibhav et al., 2019;Kryscinski et al., 2020). The majority of existing research focuses on media such as news, tables and Wikipedia passages (Guo et al., 2021;Bekoulis et al., 2021), while rarely consider the fact verification in the question-answering dialogue. In a dialogue safety domain, related studies either focus on enabling dialogue agents to resist adversarial attacks (Dinan et al., 2019a) or on forestalling aggressive or biased responses from dialogue agents (Henderson et al., 2018;Sap et al., 2019;Xu et al., 2020).
However, misinformation online can spread quickly and cause public health crises due to the abuse of dialogue agents, especially questions about the pandemic of COVID-19 (Naeem et al., 2021). The first example in Figure 1 shows a popular question about COVID-19 asked by information seekers online. The question-answering dialogue may be more vulnerable to be manipulated, since Internet users can answer the question with multiple facts or speculative and vague expressions (Sarrouti et al., 2021) that deliberately distribute misinformation. For improving the robustness of fact verification systems, they must also be valid for verifying the responses in question-answering dialogues.
The majority of previous works for the fact verification mainly focused on reasoning against pieces of evidence from Wikipedia passages, while rarely considered questions sought by Internet users. However, the questions also contain rich information to support the fact verification. Figure 1 shows two examples for the question-answering dialogue based fact verification, where the questions were posed by real users. We can see that the questions contain some critical parts (e.g., "animals", "people", "who"), which indicate the confusions of information seekers and the vulnerable part of responses. Taking the consideration above, we explore the fact verification in the question-answering dialogue and investigate how to exploit questions in the verification process.
In this paper, we present QaDialMoE, a neural network approach for Question-answering Dialogue based Fact Verification with Mixture of Experts. Inspired by that mixture of experts is applied in both dialogue systems (Le et al., 2016a) and fact verification fields (Zhou et al., 2022), we implement each expert with the same neural architecture to focus on different parts of inputs (e.g., the relationship between the response and the question). Specifically, to make our approach more generalizable, we propose a prompt module to generate questions in case that the original data only has responses. Then each expert takes the same feature as the input from the output of feature extractor module and learns to deal with the meaning of questions, responses and evidence. We design a management module to guide the training of experts by assigning a unique attention score to each expert, and combine their verification results efficiently. However, previous models tend to incorrectly predict a response as SUPPORTED when there is a significant overlap between the response and the evidence. Similarly, it incorrectly predict SUPPORTED or REFUTED for a NEI response because of the word overlap. Note that NEI is short for NOTENOUGHINFO. To alleviate this problem, we introduce an attention guidance module to generate a prior assumption and guide the manager paying more attention to the input part with few word overlap.
We conduct experiments on three benchmark datasets HEALTHVER (Sarrouti et al., 2021), FAVIQ (Park et al., 2022) and COLLOQUIAL (Kim et al., 2021). Experimental results demonstrate that our model outperforms previous systems by a large margin and achieves new state-of-the-art results on all of them. The main contributions of this paper can be summarized as follows: • We explore a fact verification in the questionanswering dialogue. To our best knowledge, this is the first study to investigate a questionanswering dialogue based fact verification and to improve the applicability of fact verification systems.
• We introduce a use of mixture of experts and a manager with an attention guidance module for question-answering dialogue based fact verification, aiming to exploit questions and evidence efficiently in the verification process.
• We propose a prompt module to make our approach more generalizable, which can generate questions by given responses.
• Our approach achieves new state-of-the-art results on all experimental benchmarks, outperforming previous approaches by a large margin.

Task Background
In this paper, we study the task of questionanswering dialogue based fact verification. Given a question Q, a response R and evidence E from Wikipedia passages, the goal is to verify the factuality of the response by the question and evidence with the label SUPPORTED or REFUTED. Beyond the label as SUPPORTED or REFUTED, the classification task has one more label called NEUTRAL or NEI, which means no enough information and cannot make a decision. Then, the task becomes a 3-way classification task.
Prior works have used question-answering dialogue data to create fact verification benchmarks (Demszky et al., 2018;Pan et al., 2021;Sarrouti et al., 2021;Park et al., 2022). Most fact verification processes only use evidence while rarely considering questions. However, we believe that the additional question that contains rich information is helpful to support the final prediction. In this study, we employ questions and evidence to validate the responses, which we formulate as the questionanswering dialogue based fact verification task.

Methodology
In this section, we present our proposed framework QaDialMoE, that leverages a set of experts to simultaneously consider the meaning of questions, responses and evidence from Wikipedia passages. The overall model structure is illustrated in Figure 2. Our method consists of three components: the feature extractor ( §3.1) with a prompt module and a transformer encoder backbone, the mixture of experts module ( §3.2) for dealing with different parts of input, and the management module ( §3.3) for guiding the training of experts and combining their ability of verification effectively.

Feature Extractor
In this section, a prompt module is proposed to generate questions by given responses. Subsequently, a transformer-based encoder parses the responsequestion (original or synthetic) -evidence pair and learns their joint semantics representations.

Prompt Module
Since question-answering dialogue based fact verification is still underexplored, few benchmark datasets use question-answering dialogue to retrieve or create responses (Sarrouti et al., 2021;Park et al., 2022). To make our approach more generalizable and explore the effectiveness of questionanswering dialogue in the fact verification task, we propose a prompt module to generate questions. Specifically, we only use the responses in original data as the input. This can be easily generalized to more datasets. Then we leverage a question-generation model to synthesize questions by the given input. The synthetic questions are further passed to transformer encoder layers to learn response-question-evidence joint semantics ( §3.1.2), and to an attention guidance module for generating prior assumptions ( §3.3.1). In this paper, we implement it with T5 (Raffel et al., 2019), a transformer-based pre-trained model.

Joint Representation Learning
Given the response-question-evidence pair ( §3.1.1), we construct a transformer-based encoder (Vaswani et al., 2017) to capture the joint semantics representation. Specifically, we tokenize the responsequestion-evidence pair r, q, e into three token sequences R, Q and E. Then the joint token sequence L r,q,e = [⟨s⟩, R, ⟨/s⟩, Q, E, ⟨/s⟩], where ⟨s⟩ and ⟨/s⟩ are the separators that indicate the beginning and the end of each token sequence. Then we feed the joint token sequence into a transformer-based encoder to learn the contextualised representation embedding: where H ∈ R n×d denotes the learned joint semantics representation. Here n is the maximum length of input and d is the representation vector dimension. f LM refers to the joint representation learning process of the transformer encoder. Finally, the joint semantics representation vectors are delivered to the experts ( §3.2) and the manager ( §3.3.2) for reasoning and management, respectively.

Mixture of Experts Module
In this part, a mixture of experts (MoE) module verifies the responses separately based on the joint semantics representation H extracted by (1). We adopt three experts to focus on different part of the joint semantics representation, since the questions and the evidence can support the final prediction by interacting with the responses jointly or separately. Specifically, a question expert focuses more on the interaction between responses and questions, an evidence expert works for the interaction between responses and evidence, and a global expert takes responses, questions and evidence all into consideration.
However, different structures specially designed for the interactions among responses, questions and evidence would limit the generalization of the proposed framework to other datasets. Inspired by (Zhou et al., 2022), we implement each expert with the same general neural architecture but using different parameter learning strategies. Specifically, each expert is designed based on a stack of transformer encoding layers to obtain the final representation h. Then we feed h into a classifier to predict the probability of each label. The process above is formulated as follows: We introduce an attention guidance module to generate prior assumption (R, Q and E mean response, question and evidence, respectively) and guide the manager assignment, then the manager summarizes the full output of experts as the final prediction.
Here, f Enc i , i = 1, ..., n are n expert encoder networks and h i ∈ R d refers the final representation vector encoded by the i th expert, which implies different understanding to the relationship among responses, questions and evidence. The probabilities p i is the prediction result from i th expert, W i 1 and W i 2 are the trainable matrices of i th expert's classifier, which projects h i to the probabilities p i . tanh and sof tmax are activation functions.

Management Module
An attention guidance module is proposed to generate prior assumptions based on response-questionevidence pair and guide the manager. The manager is designed to guide experts' training and ensemble the results from all experts, which is implemented based on transformer (Vaswani et al., 2017) model.

Attention Guidance
Previous evidence-based fact verification models always incorrectly predict due to the word overlap issue. Since question-answering dialogue has both question and evidence parts to verify the response, an attention guidance module generates prior assumption that can represent the interactions among responses, questions and evidence, and guide the manager ( §3.3.2) to focus more on questions or evidence based on their overlap degree with the response.
Specifically, the attention guidance module generates the prior assumption a G based on the response-question-evidence pair ( §3.1.1). In this paper, we consider three interactions, including response-question pair, response-evidence pair and response-question-evidence pair. We calculate the prior assumption a G as follows: 1. Initialize a prior assumption with z 0 ∈ R 3 , which is empirically set as z 0 = ((z 0 ) 0 , (z 0 ) 1 , (z 0 ) 2 ) T = (0.2, 0.2, 0.6) T . The (z 0 ) 2 represents the questions and the evidence interact with the responses jointly. It is always set higher than other values since we anticipate that this interaction can combine all the information efficiently.
2. Initialize a zero bias vector δ ∈ R 3 and calculate the response-question pair and responseevidence pair similarity scores s by TF-IDF. Then the similarity scores can be accumulated to the bias vector δ : where δ i is the i th dimension of bias vector and s i is the similarity score accumulated to δ i (e.g., TF-IDF similarity score of response and question). a i ∈ (0, 0.4) 3 is an incremental rate (set empirically) for the i th dimension of bias vector δ.
3. Add the bias vector δ to the initialized assumption z 0 and normalize to obtain the prior assumption: a G = sof tmax(z 0 + δ).
The prior assumption a G is used to teach the manager to assign scores reasonably against the attention scores a M introduced in §3.3.2. Meanwhile, it can alleviate "imbalanced experts" phenomenon reported in previous studies (Eigen et al., 2013;Zhou et al., 2022). It means that the manager keeps assigning a closeto-1 attention score to one well-trained expert and a close-to-0 to other experts that are not trained efficiently.

Manager
We present a manager module to guide the training of experts. The manager encodes the joint semantics representation H and generates attention scores where Enc M is the encoder of manager module, W M 1 and W M 2 are trainable parameters. The manger has the same network architecture as experts, only the difference in the number of encoder layers and the dimension of output.
The attention score a M and the prior assumption a G are used to guide the experts' training and teach the manager to assign scores reasonably, which are implemented by specially designed losses introduced in §3.4.

Loss
In this part, we develop two loss functions, i.e. verification loss L V and guidance loss L G . The former one is a weighted sum of classification loss from each expert with the attention scores assigning to experts by manager as (6). The latter one measures the difference between the prior assumption and the attention scores, and guides the manager to assign reasonable attention scores to experts as given in (7). We jointly optimize our model by minimizing a weighted sum of these two terms: where λ is a hyperparameter that controls the ratio of L G . The detail of these two loss functions are provided in subsequent paragraphs.

Verification Loss
We calculate each expert's cross-entropy independently, which then is weighted by the attention scores a M to sum up: where n e is the number of experts, (a M ) i is the i th score assigned by manager for the i th expert. The probabilities p i is the prediction result from i th expert, l is the ground true label of response and H CE (·, ·) refers to the cross-entropy loss function.
Guidance Loss To alleviate the "imbalanced experts" phenomenon mentioned in §3.3.1, we develop another loss function L G , which calculates the logarithmic difference between the prior assumption a G and the attention scores a M : where D KL (·||·) stands for the Kullback-Leibler divergence. By minimizing L G , the manager learns to assign reasonable attention scores to experts which means that the manager assigns each expert based on the interactions presented in §3.3.1.
Besides, the training of experts become more balanced due to loss function L G .
FAVIQ (Park et al., 2022) and COLLOQUIAL (Kim et al., 2021). These datasets are introduced as follows and the statistic information is shown in Appendix A.
HEALTHVER HEALTHVER (Sarrouti et al., 2021) contains 14,330 real-word responses retrieved by a search engine for 80 popular questions about COVID-19. Each instance in HEALTHVER consists of a question, an evidence from scientific article and a response manually annotated as SUP-PORT, REFUTE and NEUTRAL. Metrics as macro precision, macro recall, macro F1-score, and accuracy are used to evaluate the effectiveness of our model on HEALTHVER.
FAVIQ FAVIQ (Park et al., 2022) is a largescale fact verification dataset constructed from information-seeking questions (Kwiatkowski et al., 2019) and their ambiguities (Min et al., 2020). The data consists of two sets (A and R), FAVIQ A set is obtained from ambiguous question-answering pairs while FAVIQ R set uses the reference answer and the incorrect prediction to generate responses. Most instances in FAVIQ include a question, an evidence from Wikipedia passages and a response annotated as SUPPORT and REFUTE. However, the questions for A test set is hidden since the A set is made from AmbigQA (Min et al., 2020), and there is a leaderboard 4 for the competition. For A set, we only use A dev set and do not generate questions for A test set. We use the accuracy as our evaluation metric.
COLLOQUIAL COLLOQUIAL (Kim et al., 2021) is constructed by transferring the styles of claims from FEVER (Thorne et al., 2018) into colloquialism. The data is challenging due to the characteristics of colloquial claims. Most question-answering dialogue based responses are colloquial style, our model can be easily generalized to this dataset since the prompt module can generate questions for the claims. Finally, each instance in COLLOQUIAL consists of a synthetic question, an evidence from Wikipedia passages and a response with label SUP-PORTED, REFUTED or NEI. We use the label accuracy as our evaluation metric.

Implementtation Details
We download pre-trained models from huggingface 5 . For prompt module, we utilize 't5-small-squad2-question-generation', which is built based on SQuAD 2.0 dataset (Rajpurkar et al., 2018). We use 12 transformer encoder layers for the feature extractor and experts, and 2 for the manager, and we leverage RoBERTa-Large  to initialize parameters of the feature extractor and experts. For MoE module, we set n e = 3 meaning three experts in our implementation of QaDialMoE.

Main Results
FAVIQ For evidence retrieval on FAVIQ, we use three ways to obtain evidence E: (1) DPR (Qu et al., 2021), (2) evidentiality-guided generator (EG) (Asai et al., 2021), and (3) the positive evidence (PE) in the original dataset. First, we use k passages as evidence(k = 3), which are retrieved by a dual encoder based model DPR. Following Park et al. (2022), this baseline is jointly trained on the A set and the R set. Second, the generator uses a leave-one-out approach (Asai et al., 2021) to evaluate which evidence provide sufficient information. Third, the positive evidence is the top passage that contains the answer to the original question which is retrieved by TF-IDF (Park et al., 2022). Table 2 presents the performance of various models on FAVIQ. For A set, we evaluate the performance of our approach with the evidence from above three ways. QaDialMoE also achieves a new state-of-the-art with an accuracy of 78.7% by using the positive evidence. Note that QaDialMoE outperforms prior methods based on the same evidence, achieving significant improvements with 3.9% (70.8% vs. 66.9%) and 5.3% (74.9% vs. 69.6%) for DPR and EG, respectively. For R set, we evaluate our approach with DPR and the positive evidence. Since the baseline with DPR is jointly trained on the A set and the R set, where the R set mainly provides a source for data augmentation, QaDialMoE achieves comparable results on the R set but improves considerably on the A set. However, QaDialMoE with the positive evidence can reach remarkable performances with 86.1% on the dev set and 86.0% on the test set. In short, Qa-DialMoE achieves a new state-of-the-art result on the large-scale challenging fact verification bench-    Table 3 shows the performance of various models on the test set of COLLOQUIAL, where QaDialMoE again obtains a new state-ofthe-art label accuracy as 89.5%. This improvement is also surprise to us since COLLOQUIAL does not have original questions while we generate synthetic questions by our prompt module. Colloquial claims tend to include filter words (e.g., "yeah", "you know"), comments, or personal opinions which do not require a verification. However, our synthetic questions may help the model to focus more on what requires a verification, and ignore the above mentioned distractions.

Ablation Study
We further investigate the impact of question quality with an ablation study on FAVIQ A dev set. Specifically, the prompt module generates questions for both FAVIQ A train set and dev set and Model Accuracy QaDialMoE + EG 74.9 -w/ synthetic questions 69.7 (-5.2%) QaDialMoE + PE 78.7 -w/ synthetic questions 75.9 (-2.8%)   Table 4. It is clear that our QaDialMoE model has significant drops by 5.2% with the evidentiality-guided generator and 2.8% with the positive evidence. The prompt module can provide high quality questions, however, it is far less effective than using the original questions. Note that the effectiveness of using evidentialityguided generator drops more noticeably than positive evidence, which verifies that high-quality questions and evidence play an important role equally.
Effects of the guidance of the manager. We have two ablative groups as shown in Table 5: w/o L G : We conduct an ablation study on HEALTHVER dataset without the guidance loss L G . As presented in Table 5, QaDialMoE model has drops by 1.1% (83.29% vs. 82.01%) and 1.2% (84.26% vs. 83.04%) in macro F1-score and accuracy while training without the guidance loss. Meanwhile, the "imbalanced experts" phenomenon ( §3.3.2) will be discussed more detailed in Sec. 4.4.3.
w/ fixed a G : We initialize the prior assumption a G with the same weights for each parameter and do not use the inverse TF-IDF similarity to correct. It means that each expert is equally important. Qa-DialMoE model also has a significant drop with this setting of prior assumption. Besides, we find that this model is even less effective than the model without the guidance loss. It means negative set- tings for the prior assumption may lead to inferior model results. The Full model with inverse TF-IDF similarity correcting the prior assumption is a positive setting. As aforementioned, since the word overlap issue, we guide the manager to pay more attention to the input part with few word overlap, which is proven to have a good performance.

Analyzing Experts Differentiation
To further understand the effectiveness of the proposed framework, we investigate the differentiation of experts, which means that the model can achieve balanced training across experts based on the prior assumption from the attention guidance. As shown in Figure 3a and 3b, each expert is well-trained and the "imbalanced experts" phenomenon does not occur. Note that there is no unique expert always outperforms others, which illustrates that experts behave independently due to the various interactions among responses, questions and evidence. However, as shown in Figure 3c, once training is performed without the guidance loss, there is only one well-trained expert, and the performance of the other two experts stay around 50% while training steps increasing. It seems the model degenerates to the point where only one RoBERTa-Large  is working, which is a simpler model, but far less effective than the Full model.

Related Work
Fact Verification To mitigate the spread of false information online, a fact verification task has gained widespread attention recently. Previous studies on the fact verification are mainly based on pieces of evidence from Wikipedia articles (Thorne et al., 2018;Hanselowski et al., 2018;Yoneda et al., 2018;Thorne et al., 2019;Nie et al., 2019;. Since the proposal of the TABFACT , a large dataset for table-based fact verification, studies against semi-structured evidence attach much attention Shi et al., 2020;Eisenschlos et al., 2020;Shi et al., 2021;. However, fact verification in a question-answering dialogue is still an underexplored area. Gupta et al. (2021) explored fact verification for the dialogue context, curated by converting grounding dialogues from the Wizardof-Wikipedia (Dinan et al., 2019b) dataset. Meanwhile, several works have used question-answering dialogue data to construct fact verification benchmarks (Demszky et al., 2018;Pan et al., 2021;Sarrouti et al., 2021;Park et al., 2022). Different from previous works, we formulate the question-answering dialogue based fact verification task, which focuses on various interactions among responses, questions and evidence that support the validate process.

Mixture of Experts
Mixture of experts is an ensemble learning method that first introduced by Jacobs et al. (1991), which is used to divide the problem space into homogeneous regions (Baldacchino et al., 2016). Specifically, it first decomposes a task into sub-tasks and then trains an expert model on each sub-task, a gating model is applied to learn which expert is competent and combine the predictions. Mixture of experts has been applied in various fields, such as dialogue systems (Le et al., 2016b), content recommendation (Ma et al., 2018; and image classification Riquelme et al., 2021). Zhou et al. (2022) develop a mixture-of-experts framework for table-based fact verification, where each expert is used to deal with different types of reasoning. In this paper, we leverage a mixture-of-experts module to recognize and execute various interactions among responses, questions and evidence, which we formulate as the question-answering dialogue based fact verification task.

Conclusion
In this paper, we present QaDialMoE, a new method for fact verification in question-answering dialogue that exploits the mixture of experts to focus on various interactions among responses, ques-tions and evidence. We also generate synthetic questions with a prompt module to make our approach more generalizable. A manager with an attention guidance module is applied to guide the training of experts and assign a reasonable attention score to each expert. Experimental results on three datasets HEALTHVER, FAVIQ and COLLOQUIAL demonstrate that QaDialMoE outperforms previous approaches by a large margin and achieves new state-of-the-art results on all of them. The ablation studies and analysis further indicate that questions and evidence play an equal important role in our proposed framework. We hope our work can facilitate fact verification in a question-answering dialogue domain, and open the way to efficiently exploit questions and evidence in the verification process.

Limitations
The first limitation of our approach is the quality of synthetic question. As mentioned above, we employ the pre-tained model T5 to generate questions. It works well and can provide high quality questions, but is far less effective than using the original questions. In practice, when the quality of both questions and evidence are low, the performance of model will drop significantly. The second limitation is that the task of verifying responses in common dialogue cannot benefit from our proposed framework. We have tried to apply QaDialMoE in DialFact (Gupta et al., 2021), a benchmark for fact verification in dialogue, where questions in the input convert to the dialogue context. However, QaDialMoE does not show a significant advantage over the best method. We attribute this to two factors: first, the common dialogue based fact verification requires more sophisticated interactions among responses, dialogue contexts and evidence, since the common dialogue are more informal than question-answering dialogue; second, the benchmark consists of multi-turn dialogue while the final response needs to be validated. Rather than that simply replacing questions with dialogue contexts as part of the input, we may need to model the relationships between dialogue contexts. Table 6 shows the statistics of HEALTHVER (Sarrouti et al., 2021), a dataset for fact verification of health-related real world responses.The whole dataset is split into three subsets for training, validation and testing by claims and thus have a balanced dataset class-wise. Table 7 shows the statistics of FAVIQ (Park et al., 2022), a large-scale challenging fact verification dataset, which consists of 188k claims. FAVIQ-A is created from ambiguous questions, while FAVIQ-R includes claims from regular question-answering dialogues. Table 8 shows the statistics of COLLO-QUIAL (Kim et al., 2021), which transfers the claims from FEVER (Thorne et al., 2018) into a colloquial style. Each claim in COLLOQUIAL has three more words on average than in FEVER. We generate questions for the whole test set in COL-LOQUIAL while a part for training and a part for validation.

Set
Supports  • FEVER (REFUTES): Barack Obama will forgo a presidential library in favor of a presidential science museum.
• Colloquial claim: Oh yeah. Obama decided to forgo building a presidential library in favor of building a presidential science museum.
• Synthetic question: What did Obama decide to forgo?
Example 2: • FEVER (REFUTES): Transformers: Revenge of the Fallen grossed a total of 836.4 million dollars worldwide.
• Colloquial claim: Yes they were, Transformers Revenge of the Fallen grossed 836.4 million dollars worldwide.
• Synthetic question: How many dollars did Transformers Revenge of the Fallen grosse?
• Colloquial claim: I remember the song "Yung Rich Nation". It was produced by Zaytoven.    (Thorne et al., 2018) and the statistics of our generated questions.
• Synthetic question: What song was produced by Zaytoven?
Example 4: • FEVER (SUPPORTS): Henry VIII of England had a war with the Holy Roman Emperor Charles V.
• Colloquial claim: Yep! Henry VIII, starting the war with Charles V.
• Synthetic question: Who started the war with Charles V?
Example 5: • FEVER (SUPPORTS): An Emmy award was won by Mad Men.
• Colloquial claim: Yes, it is! Mad Men actually won an Emmy award.
• Synthetic question: What award did Mad Men win?
FAVIQ We generate questions for FAVIQ A dev set to investigate the impact of question quality with an ablation study.  Table 9: Automatic evaluation results of the question quality for FAVIQ A dev set by BLEU 1-4 (Papineni et al., 2002) and ROUGE L (Lin, 2004) • Original question: who's the starting quarterback for the buffalo bills in 2016?
• Synthetic question: what quarterback was the starting quarterback for the buffalo bills?
Example 2: • Response (REFUTES): the upper school of the minnehana academy is located at 4200 west river parkway, minneapolis, minnesota, 55406 in minneapolis.
• Original question: where is the upper school of the minnehana academy in minneapolis?
• Synthetic question: what is the upper school of the minnehana academy located at?
Example 3: • Response (REFUTES): the new independence day came out in 1996 throughout the united states on june 24, 2016.
• Original question: when does the new independence day come out in 1996 throughout the united states?
• Synthetic question: when did the new independence day come out?
Example 4: • Response (SUPPORTS): 11 players in one team can play on the field for american football.
• Original question: how many players in one team can play on the field for american football?
• Synthetic question: how many players in one team can play on the field for american football?
Example 5: • Response (SUPPORTS): melinda o. fee played jill abbott on the young and restless on 1984.
• Original question: who played jill abbott on the young and restless on 1984?
• Synthetic question: who played jill abbott on 1984?