It is AI’s Turn to Ask Humans a Question: Question-Answer Pair Generation for Children’s Story Books

Existing question answering (QA) techniques are created mainly to answer questions asked by humans. But in educational applications, teachers often need to decide what questions they should ask, in order to help students to improve their narrative understanding capabilities. We design an automated question-answer generation (QAG) system for this education scenario: given a story book at the kindergarten to eighth-grade level as input, our system can automatically generate QA pairs that are capable of testing a variety of dimensions of a student’s comprehension skills. Our proposed QAG model architecture is demonstrated using a new expert-annotated FairytaleQA dataset, which has 278 child-friendly storybooks with 10,580 QA pairs. Automatic and human evaluations show that our model outperforms state-of-the-art QAG baseline systems. On top of our QAG system, we also start to build an interactive story-telling application for the future real-world deployment in this educational scenario.


Introduction
There has been substantial progress in the development of state-of-the-art (SOTA) questionanswering (QA) models in the natural language processing community in recent years (Xiong et al., 2019;Karpukhin et al., 2020;Cheng et al., 2020;Mou et al., 2021). However, the opposite of QA tasks-question-answer generation (QAG) tasks that generate questions based on input text-is yet underexplored. We argue, being able to ask a reasonable question is also an important indicator whether the * † Equal contributions from the first authors: yaob@rpi.edu, dakuo.wang@ibm.com; Work was done while Mo was at IBM Research. * Corresponding authors.

FairytaleQA Dataset Source (Section)
Maie sighed. she knew well that her husband was right, but she could not give up the idea of a cow. the buttermilk no longer tasted as good as usual in the coffee; ... ... they were students, on a boating excursion, and wanted to get something to eat.'bring us a junket, good mother,' cried they to Maie.'ah! if only i had such a thing!' sighed Maie.
Ground-Truth • Q: What did the three young men ask for? • A: A junket.

2-
Step Baseline (Shakeri et al., 2020) • Q: Why no more buttermilk for her husband to make? • A: She could not give up the idea of a cow.
PAQ Baseline (Lewis et al., 2021) • Q: What did maie think of when she thought of buttermilk?
• A: Sweet cream and fresh butter.
Our System • Q: Why did the three young men want a junket? • A: They wanted to get something to eat. Table 1: A sample of FairytaleQA story section as input and the QA pairs generated by human education experts, 2-step baseline model, PAQ baseline, and our QAG System. reader comprehends the document, thus belongs to the reading comprehension(RC) task family. QAG also contributes to important real-world applications, such as building automated systems to support teachers to efficiently construct assessment questions (and its correct answer) for the students at a scale (Xu et al., 2021;Snyder et al., 2005).
Similar to training QA models, QAG model training requires high-quality and large-scale RC datasets (e.g., NarrativeQA (Kočiskỳ et al., 2018)). However, many of the existing datasets are either collected via crowd-sourcing (Rajpurkar et al., 2016;Kočiskỳ et al., 2018;Reddy et al., 2019), or using automated retrievers (Nguyen et al., 2016;Joshi et al., 2017;Dunn et al., 2017;Kwiatkowski et al., 2019), thus risking the quality and validity of labeled QA-pairs. This risk becomes especially problematic when building applications in the education domain: While existing QA models may perform well for the general domain, they fall short in understanding what are the most useful QA pairs to generate for educational purposes. Specifically, RC is a complex skill vital for children's achievement (Snyder et al., 2005), the datasets should contain questions that focus on a well-defined construct (e.g., narrative comprehension) and measure a full coverage of sub-skills within this construct (e.g., reasoning causal relationship and understanding emotion within narrative comprehension) using items of varying difficulty levels (e.g., inference making and information retrieval) (Paris and Paris, 2003).
In this work, we aim to develop a QAG system to generate high-quality QA-pairs, emulating how a teacher or parent would ask children when reading stories to them (Xu et al., 2021). Our system is built on a novel dataset that was recently released, FairytaleQA . This dataset focuses on narrative comprehension for elementary to middle school students and contains 10,580 QApairs from 278 narrative text passages of classic fairytales. As reported in , Fairy-taleQA was annotated by education experts and includes well-defined and validated narrative elements laid out in the education research (Paris and Paris, 2003), making it particularly appealing for RC research in the education domain.
Our QAG system design consists of a three-step pipeline: (1) to extract candidate answers from the given storybook passages through carefully designed heuristics based on a pedagogical framework; (2) to generate appropriate questions corresponding to each of the extracted answers using a state-of-the-art (SOTA) language model; and (3) to rank top QA-pairs with a specific threshold for the maximum amount of QA-pairs for each section.
We compare our QAG system with two existing SOTA QAG systems: a 2-step baseline system (Shakeri et al., 2020) fine-tuned on FairytaleQA, and the other is an end-to-end generation system trained on a large-scale automatically generated RC dataset (PAQ) (Lewis et al., 2021). We evaluate the generated QA-pairs in terms of similarity by Rouge-L precision score with different thresholds on candidate QA-pair amounts and semantic as well as syntactic correctness by human evaluation. We demonstrate that our QAG system performs better in both automated evaluation and human evaluation. Table 1 is a sample of FairytaleQA story as input and the QA pairs generated by human education experts, 2-step baseline model, PAQ baseline, and our QAG System. We conclude the paper by demoing an interactive story-telling application that built upon our QAG system to exemplify the applicability of our system in a real-world educational setting.

General QA Datasets
There exists a large number of datasets available for narrative comprehension tasks. These datasets were built upon different knowledge resources and went through various QA-pair creating approaches. For instance, some focus on informational texts such as Wikipedia and website articles (Rajpurkar et al. (2016), Nguyen et al. (2016), Dunn et al. (2017), Kwiatkowski et al. (2019), Reddy et al. (2019)). Prevalent QA-pair generating approaches include crowd-sourcing (Rajpurkar et al., 2016;Kočiskỳ et al., 2018;Reddy et al., 2019), using automated QA-pair retriever (Nguyen et al., 2016;Joshi et al., 2017;Dunn et al., 2017;Kwiatkowski et al., 2019), and etc. Datasets created by the approaches mentioned above are at risk of not consistently controlling the quality and validity of QA pairs due to the lack of well-defined annotation protocols specifically for the targeting audience and scenarios. Despite many of these datasets involving large-scale QA pairs, recent research (Kočiskỳ et al., 2018) found that the QA pairs in many RC datasets do not require models to understand the underlying narrative aspects. Instead, models that rely on shallow pattern matching or salience can already perform very well.
NarrativeQA, for instance, (Kočiskỳ et al., 2018) is a large dataset with more than 46,000 humangenerated QA-pairs based on abstractive summaries. Differing from most other RC datasets that can be answerable by shallow heuristics, the Nar-rativeQA dataset requires the readers to integrate information about events and relations expressed throughout the story content. Indeed, NarrativeQA includes a significant amount of questions that focus on narrative events and the relationship among events (Mou et al., 2021). One may expect that NarrativeQA could also be used for QAG tasks.
In fact, a couple of recent works use this dataset and train a network by combining a QG module and a QA module with a reinforcement learning approach (Tang et al., 2017). For example, Wang et al. (2017) use the QA result to reward the QG module then jointly train the two sub-systems. In addition, Nema and Khapra (2018) also explore better evaluation metrics for the QG system. However, the NarrativeQA dataset is in a different domain than the educational context of our focus. Thus the domain adaptation difficulty is unknown.

The FairytaleQA Dataset
As previously mentioned, the general-purpose QA datasets (e.g., SQuAD (Rajpurkar et al., 2016), MS MARCO (Nguyen et al., 2016)) are unsuitable for children's education context, as they impose little structure on what comprehension skills are tested and heavily rely on crowd workers typically with limited education domain knowledge. FairytaleQA ) is a newly released RC dataset that precisely aims to solve those issues and complement the lack of a high-quality dataset resource for the education domain. This dataset contains over 10,000 high-quality QA-pairs from almost 300 children's storybooks, targeting students from kindergarten to eighth grade.
As discussed in , Fairy-taleQA has two unique advantages that make it particularly useful for our project. First, the FairytaleQA was developed based on an evidence-based reading comprehension framework (Paris and Paris, 2003), which comprehensively focuses on seven narrative elements/relations contributing to reading comprehension: character, setting, feeling, action, causal relationship, outcome resolution, and prediction (Detailed definition and example of each aspect is described in Appendix A). Second, the development of FairytaleQA followed a rigorous protocol and was fulfilled by trained annotators with educational research backgrounds. This process ensured that the annotation guideline was followed, the style of questions generated by coders was consistent, and the answers to the questions were factually correct. FairytaleQA was reported to have high validity and reliability through a validation study involving actual students .

QAG Task
A few years back, rule-based QAG systems (Heilman and Smith, 2009;Mostow and Chen, 2009;Yao and Zhang, 2010;Lindberg et al., 2013;Labutov et al., 2015) were prevalent, but the generated QA suffered from the lack of variety. Neural-based models for question generation tasks (Du et al., 2017;Dong et al., 2019;Scialom et al., 2019;Zhao et al., 2022) have been an emerging research theme in recent years. But their focus are on the general domain QAG thus they only used the available general QA dataset for training, we have no idea how these models may perform in an education contxt.
In this paper, we use one recent work Shakeri et al. (2020) as our baseline. They proposed a two-step and two-pass QAG method that firstly generate questions (QG), then concatenates the questions to the passage and generates the answers in a second pass (QA). In addition, we include the recently-published Probably-Asked Questions (PAQ) (Lewis et al., 2021) work as a second baseline. The PAQ system is an end-to-end QAG system trained on the PAQ dataset, a very large-scale QA dataset containing 65M automatically generated QA-pairs from Wikipedia. The primary issue with deep-learning-based models in the targeted children education application is that existing datasets and models do not consider the specific audience's language preference and the educational purposes (Hill et al., 2015;Yao et al., 2012).
Because both rule-based and neural-networkbased approaches have their limitations inherently, in our work, we combine these two approaches to balance both the controllability of what types of QA pairs should be generated to better serve the educational purpose, and the diversity of the generated QA sequences.

Pre-processing FairytaleQA Dataset
The released FairytaleQA contained 10,580 QA-pairs from 278 books, and each question comes with a label indicating the narrative element(s)/relation(s) the question aims to assess.
We split the dataset into train/validation/test splits with 232/23/23 books and 8,548/1,025/1,007 QA pairs. The split is random, but the statistical distributions in each split are consistent. Table 2 shows core statistics of the FairytaleQA dataset in each split, and Figure 1 shows the distribution of seven types of annotations for the QA pairs across   the three splits.

Question Answer Generation System Architecture
There are three sub-modules in our QA generation (QAG) pipeline: a heuristics-based answer generation module (AG), followed by a BARTbased (Lewis et al., 2019) question generation module (QG) module fine-tuned on FairytaleQA dataset, and a DistilBERT-based (Sanh et al., 2019) ranking module fine-tuned on FairytaleQA dataset to rank and select top N QA-pairs for each input section. The complete QAG pipeline of our system is shown in Figure 2.

Heuristics-based AG Module
Based on our observation of the FairytaleQA dataset, educational domain experts seem to have uniform preferences over certain types of question and answer pairs (Figure 1). This may be because these experts take the young children's learning objectives into consideration -children's learning ability should be oriented toward specific types of answers to maximize their learning outcome. That is why educational experts rarely ask yes/no questions in developing or assessing children's reading comprehension. For automated QAG systems, we can design the system to mimic human behaviors either by defining heuristics rules for the answer extraction module, or leaving the filtering step to the end after the QA pairs are generated. However, the latter approach may have inherent risks that the training data could influence the types of answers generated. We decided to develop and apply the heuristic rules to the answer extraction module. We observed that some narrative elements such as characters, setting, and feelings are mostly made up of name entities and noun chunks, for instance, the character name in a story, a particular place where the story takes place, or a specific emotional feeling. We then leverage the Spacy 1 English model for Part-of-speech tagging on the input content to extract named entities and noun chunks as candidate answers to cover these three types of narrative elements.
We further observed that the QA pairs created by education experts around the action, causal relationship, prediction, and outcome resolution categories are all related to a particular action event in the story. Thus, the answers to these four types of questions are generally the description of the action event. We realize that Propbank's semantic roles labeler (Palmer et al., 2005) toolkit is constructive for extracting the action itself and the event description related to the action. We then leverage this toolkit to extract the trigger verb as well as other dependency nodes in the text content that can be put together as a combination of subject, verb, and object and use these as candidate answers for the latter four categories.
Our answer extraction module can generate candidate answers that cover all 7 narrative elements with the carefully designed heuristics.

BART-based QG Module
Following the answer extraction module that yields candidate answers, we design a QG module which takes a story passage and an answer as input, and generates the corresponding question as output.
The QG task is basically a reversed QA task. Such a QG model could be either transfer-learned from another large QA dataset or fine-tuned on our Fairy-taleQA dataset. Mainstream QA datasets do cover various types of questions in order to comprehensively evaluate QA model's reading comprehension ability; for instance, NarrativeQA (Kočiskỳ et al., 2018) is a large-scale QA corpus with questions that examine high-level abstractions to test the model's narrative understanding. We choose NarrativeQA dataset as an alternative option for fine-tuning our QG model because this dataset requires human annotators to provide a diverse set of questions about characters, events, etc., which is similar to the types of questions that education experts created for our FairytaleQA dataset. In addition, we leverage BART (Lewis et al., 2019) as the backbone model because of its superior performance on NarrativeQA according to the study in (Mou et al., 2021).
We perform a QG task comparison to examine the quality of questions generated for FairytaleQA dataset by one model fine-tuned on NarrativeQA, one on FairytaleQA, and the other on both the NarrativeQA and FairytaleQA. We fine-tune each model with different parameters and acquire the one with the best performance on the validation and test splits of FairytaleQA dataset. Results are shown in Table 3. We notice that the model fine-tuned on FairytaleQA alone outperforms the other methods. We attribute this to the domain and distribution differences between the two datasets. That is why the model fine-tuned on both Narra-tiveQA and FairytaleQA may be polluted by the NarrativeQA training. The best-performing model is selected for our QG module in the QAG pipeline.

DistilBERT-based Ranking Module
Our QAG system has generated hundreds of candidate QA-pairs through the first two modules. However, we do not know the quality of these generated QA-pairs by far, and it is unrealistic to send back all the candidate QA-pairs to users in a real-world scenario. Consequently, a ranking module is added to rank and select the top candidate QA-pairs, where the user is able to determine the upper limit of generated QA-pairs for each input text content. Here, the ranking task can be viewed as a classification task between the ground-truth QA-pairs created by education experts and the generated QA-pairs generated by our systems.
We put together QA-pairs generated with the first two modules of our QAG system as well as groundtruth QA-pairs from the train/validation/test splits of FairytaleQA dataset, forming new splits for the ranking model, and fine-tuned on a pre-trained Dis-tilBERT model. We test different input settings for the ranking module, including the concatenation of text content and answer only, as well as the concatenation of text content, question, and answer in various orders. Both input settings can achieve over 80% accuracy on the test split, while the input setting of the concatenation of text content, question, and answer can achieve F1 = 86.7% with a leading more than 5% over other settings. Thus, we acquire the best performing ranking model for the ranking module in our QAG system and allow users to determine the amount of top N generated QA-pairs to be outputted.

Evaluation
We conduct both automated evaluation and human evaluation for the QAG task. The input of the QAG task is a section of the story (may have multiple paragraphs), and the outputs are generated QA pairs. Unlike QA or QG tasks that each input corresponds to a single generated output no matter what model is used, the QAG task does not have a fixed number of QA-pairs to be generated for each section. Besides, various QAG systems will generate different amounts of QA-pairs for the same input content. Therefore, we carefully define an evaluation metric that is able to examine the quality of generated QA-pairs over a different amount of candidate QA-pairs. The comparison is on the validation and test splits of FairytaleQA.

Baseline QAG Systems
We select a SOTA QAG system that uses a two-step generation approach (Shakeri et al., 2020) as one baseline system (referred as 2-Step Baseline).
In the first step, it feeds a story content to a QG model to generate questions; then, it concatenates each question to the content passage and generates a corresponding answer through a QA model in the second pass. The quality of generated questions not only relies on the quality of the training data for the QG and QA models but also is not guaranteed to be semantically or syntactically correct because of the nature of neural-based models. We replicate this work by fine-tuning a QG model and a QA model on FairytaleQA dataset with the same procedures that help us select the best model for our QG module. We use pre-trained BART just like ours as the backbone model to ensure different model architectures do not influence the evaluation results. Unlike our QG module that takes both an answer and text content as the input, their QG model only takes the text content as input. Thus, we are not able to evaluate the QG model solely for this baseline. We replicate the fine-tuning parameters for our QG module to finetune the baseline QG model. For the selection of QA model used in the 2-Step Baseline, similar to the QG experiments we present in Table 3, we fine-tune a pre-trained BART on each of the three settings: NarrativeQA only, FairytaleQA only, and both datasets. According to Table 4, the model that fine-tuned on both NarrativeQA and FairytaleQA datasets performs much better than the other settings and outperforms the model that fine-tuned on FairytaleQA only by at least 6%. We leverage the best performing QA model for the 2-Step Baseline system.
In addition, we also include the recently published Probably-Asked Questions (PAQ) work as a second baseline system (Lewis et al., 2021). PAQ dataset is a semi-structured, very large scale Knowledge Base of 65M QA-pairs. PAQ system is an endto-end QA-pair generation system that is made up of four modules: Passage Scoring, Answer Extraction, Question Generation, and Filtering Generated QA-pairs. The PAQ system is trained on the PAQ dataset. It is worth pointing out that during the end-to-end generation process, their filtering module requires loading the complete PAQ corpus into memory for passage retrieval, which leads us to an out-of-memory issue even with more than 50G RAM. 2 In comparison, our QAG system requires less than half of RAM in the fine-tuning process. In Table 1, we show a sample of FairytaleQA story section as input and the QA pairs generated by human education experts, 2-step baseline model, PAQ baseline, and our QAG System. A few more examples are provided in Appendix C.

Evaluation Metrics
Since the goal of QAG is to generate QA-pairs that are most similar to the ground-truth QA-pairs given the same text content, we concatenate the question and answer to calculate the Rouge-L precision score for every single QA-pair evaluation. However, the amount of QA-pairs generated by various systems is different. It is unfair and inappropriate to directly compare all the generated QA-pairs from different systems. Moreover, we would like to see how QAG systems perform with different thresholds on candidate QA-pair amounts. In other words, we are looking at ranking metrics that given an upper bound N as the maximum number of QA-pairs can be generated per section, how similar the generated QA-pairs are to the groundtruth QA-pairs. Generally, there are three different ranking metrics: Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (NDCG). While MRR is only good to evaluate a single best item from the candidate list and NDCG requires complete rank ratings for each item, neither metric is appropriate in our case. As a result, We decide to use MAP@N, where N ∈ [1, 3, 5, 10], as our evaluation metric for the QAG generation task. Furthermore, since the average amount of ground-truth answers are close to 3 per section in FairytaleQA dataset (Table 2), we expect the MAP@3 is the most similar to the actual use case, and we provide four N to describe the comparison results and trends for QAG systems on the FairytaleQA.
Here is the detailed evaluation process on MAP@N: for each ground-truth QA-pair, we find the highest Rouge-L precision score on the concatenation of generated question and answer, among top N generated QA-pairs from the same story section. Then we average overall ground-truth QApairs to get the MAP@N score. This evaluation metric evaluates the QAG system's performance on different candidate levels and is achievable even there is no ranking module in the system. For our QAG system, we just need to filter top N QApairs from our ranking module; for the 2-Step Baseline and the PAQ baseline system, we simply adjust a topN parameter in the configuration.

Evaluation Results
Table 5 presents the evaluation results of our system and two SOTA baseline systems in terms of MAP@N, N ∈ [1, 3, 5, 10]. We observe our system outperforms both the 2-Step baseline system and PAQ system in all settings with significantly better Rouge-L precision performance on both the validation and test splits of FairytaleQA dataset. According to the evaluation results, the 2-Step baseline system suffers from the inherent lack of quality control of neural models over both generated answers and questions. We notice that the ranking module in our QAG system is an essential component of the system in locating the best candidate QA-pairs across different limits of candidate QA-pair amounts. The more candidate QA-pairs allowed to be selected for each section, the better our system performs compared to the other two baseline systems. Still, the Rouge-L score lacks the ability to evaluate the syntactic and semantic quality of generated QA-pairs. As a result, we further conduct a human evaluation to provide qualitative interpretations.

Human Evaluation of QA Generation
We recruited five human participants (N = 5) to conduct a human evaluation to evaluate further our model generated QA quality against the groundtruth and the baseline (only against PAQ system as it outperforms the 2-Step Baseline).
In each trial, participants read a storybook section and multiple candidate QA pairs for the same section: three generated by the baseline PAQ system, three generated by our system (top-3), and the  others were the ground-truth. Participants did not know which model each QA pair was from. The participant was asked to rate the QA pairs along three dimensions using a five-point Likert-scale.
• Readability: The generated QA pair is in readable English grammar and words. • Question Relevancy: The generated question is relevant to the storybook section. • Answer Relevancy: The generated answer is relevant to the question.
We first randomly selected 7 books and further randomly selected 10 sections out of these 7 books (70 QA pairs). Each participant was asked to rate these same 70 QA pairs to establish coding consistency. The intercoder reliability score (Krippendoff's alpha (Krippendorff, 2011)) among five participants along the four dimensions are between 0.73 and 0.79, which indicates an acceptable level of consistency.
Then, we randomly selected 10 books (5 from test and 5 from validation splits), and for each book, we randomly selected 4 sections. Each section, on average, has 9 QA-pairs (3 from each model). We assigned each section randomly to two coders. In sum, each coder coded 4 books (i.e. 16 sections and roughly 140 QA-pairs), and in total 722 QA-pairs were rated.
We conducted t-tests to compare each model's performance. The result ( All results show our model has above-average (>3) ratings, which suggests it reaches an acceptable user satisfaction along all three dimensions.

Question Answer Generation in an Interactive Storytelling Application
To exemplify the real-world application of our QAG system, we developed an interactive storytelling application built upon our QAG system. This system is designed to facilitate the language and cognition development of pre-school children via interactive QA activities during a storybook reading session. For example, as children move on to a new storybook page, the back-end QAG system will generate questions for the current section. Furthermore, to optimize child engagement in the QA session, the QAG system also generates follow-up questions for each answered question as shown in Figure 3. A conversational chatbot interacts with children, reads the story, facilitates questioning-and-answering via speech. The system can also keep track of child performance for the parents.
A preliminary user study with 12 pairs of parents and children between the ages of 3-8 suggests that this application powered by our QAG system can successfully maintain engaging conversations with children about the story content. In addition, both parents and children found the system useful, enjoyable, and easy to use. Further evaluation and deployment details of this interactive storytelling system can be found in .

Conclusion and Future Work
In this work, we explore the question-answer pair generation task (QAG) in an education context for young children. Leveraging a newly-constructed expert-annotated QA dataset built upon childoriented fairytale storybooks (FairytaleQA), we implemented a QA-pair generation pipeline which, as observed in human and automated evaluation, effectively supports our objective of automatically generating high-quality questions and answers at scale. To examine the model's applicability in the real world, we further built an interactive conversational storybook reading system that can surface the QAG results to children via speech-based interaction.
Our work lays a solid foundation for the promising future of using AI to automate educational question answering tasks. In the future, we plan to re-cruit educational experts to evaluate the educational efficacy of the QA-pairs as an additional evaluation dimension. Another future direction is to develop a context-aware multi-turn QAG system grounded by the story narratives (similar to (Li et al., 2021) ), where the generation of a new turn of QA is conditioned on previous generations as well as the book, so that it can enable new automated dialogue systems in the education setting. Table 7 shows detailed definition and example for each of the 7 narrative elements in FairytaleQA dataset.  Table 9 shows two more examples of FairytaleQA story section as input and the QA pairs generated by human education experts, 2-step baseline model, PAQ baseline, and our QAG System. Figure 4 is a screenshot of the interactive storytelling system interface StoryBuddy  for the down-streaming task of our QAG system in a real-world use scenario. Children can listen to the automatic story reading and try to answer the plot-relevant questions generated by the QAG system. They can answer the question via a microphone, and the system will judge the correctness of their answer. After answering a 'parent' question, children can go further to answer a follow-up question or try out other 'parent' questions.

E Fine-tuning Parameters
For fine-tuning the QA model for the 2-Step Baseline, we select the best performing model with the following hyper-parameters: learning rate = 5e −6 ; batch size = 1; epoch = 1. For fine-tuning the QG model for our QAG system, we select the best performing model with the following hyper-parameters: learning rate = 5e −6 ; batch size = 1; epoch = 3.
Then they passed through the dark cavern of horrors, when she'd have heard the most horrible yells, only that the fairy stopped her ears with wax. she saw frightful things, with blue vapours round them, and felt the sharp rocks and the slimy backs off rogs and snakes.when they got out of the cavern, they were at the mountain of glass; and then the fairy made her slippers so sticky with a tap of her rod that she followed the young corpse quite easily to the top. there was the deep sea a quarter of a mile under them, and so the corpse said to her,"go home to my mother, and tell her how far you came to do her bidding.farewell!" he sprung head-foremost down into the sea, and after him she plunged, without stopping a moment to think about it.

Ground-Truth
• Q: What did the fairy do to the youngest on the mountain of glass?
• A: Made her slippers so sticky with a tap of her rod.
• A: Go home to my mother, and tell her how far you came.
PAQ Baseline (Lewis et al., 2021) • Q: What did the fairy stop her ears with? • A: Wax.
Our System • Q: What did the youngest princess see when she entered into the dark cavern of horrors?
• A: She saw frightful things , with blue vapours round them.
Once upon a time there was a scholar, who wandered away from his home and went to emmet village. there stood a house which was said to be haunted. yet it was beautifully situated and surrounded by a lovely garden. so the scholar hired it. one evening he was sitting over his books, when several hundred knights suddenly came galloping into the room. they were quite tiny, and their horses were about the size of flies. they had hunting falcons and dogs about as large as gnats and fleas.they came to his bed in the corner of the room, and there they held a great hunt, with bows and arrows: one could see it all quite plainly.they caught a tremendous quantity of birds and game, and all this game was no larger than little grains of rice .

Ground-Truth
• Q: Who wandered away from his home and went to emmet village ?
• A: A scholar.

2-
Step Baseline (Shakeri et al., 2020) • Q: What happened one evening? • A: Several hundred knights suddenly came galloping into the room .
PAQ Baseline (Lewis et al., 2021) • Q: Where did the scholar go when he wandered away from home?
Our System • Q: Who wandered away from his home and went to emmet village?
• A: A scholar.