Question Generation for Adaptive Education

Intelligent and adaptive online education systems aim to make high-quality education available for a diverse range of students. However, existing systems usually depend on a pool of hand-made questions, limiting how fine-grained and open-ended they can be in adapting to individual students. We explore targeted question generation as a controllable sequence generation task. We first show how to fine-tune pre-trained language models for deep knowledge tracing (LM-KT). This model accurately predicts the probability of a student answering a question correctly, and generalizes to questions not seen in training. We then use LM-KT to specify the objective and data for training a model to generate questions conditioned on the student and target difficulty. Our results show we succeed at generating novel, well-calibrated language translation questions for second language learners from a real online education platform.


Introduction
Online education platforms can increase the accessibility of educational resources around the world. However, achieving equitable outcomes across diverse learning needs benefits from systems that are adaptive and individualized to each student (Doroudi and Brunskill, 2019). Traditionally, adaptive education methods involve planning over a pool of pre-made questions (Atkinson, 1972;Hunziker et al., 2018). These are naturally limited by the diversity and coverage of the pool, as well as the scaling capacity of curriculum planning algorithms. Recent approaches, such as procedural generation for personalized programming games (Valls-Vargas et al., 2017), are limited to well-specified small domains. We address these limitations by leveraging recent success in deep generative models, in particular language models (LMs).
Many educational activities involve sequential data, such as language translation, reading compre-  Figure 1: Example input and outputs for our LM-based knowledge tracing model (middle) and question generation model (bottom) for an online reverse language translation task (top). A question in this task consists of a target phrase for the student, in this case a Spanish learner, to translate (e.g. "the woman").
hension, algebra, and deductive logic. Meanwhile, pre-trained LMs can effectively handle sequences from a wide range of modalities (Madani et al., 2020;Polu and Sutskever, 2020). In this work, we focus on natural language sequences, where recent progress in language modeling has shown great success at capturing abstract properties of language (Hewitt and Manning, 2019; Liu et al., 2019). Specifically, we show how pre-trained LMs can be easily leveraged to adaptively generate questions for a given student and target difficulty in a reverse translation task, using difficulty at answering questions as a proxy for more complex future learning objectives.
We introduce an LM-based knowledge tracing model (LM-KT) to predict students' difficulty on novel questions (e.g. target phrases to translate). We show that LM-KT is well-calibrated, allowing us to pose the learning problem for the question generator: given a student state, generate a question that will achieve a target difficulty, according to LM-KT. We evaluate both LM-KT and question generation models on real users and responses from Duolingo 1 , a popular online second-language learning platform.

Background & Related Works
There exists a rich body of work on precisely modeling student "ability" and learning. For example, Item Response Theory (IRT) seeks to model individual student ability based on their responses to different questions, creating a strong factorization between students and test items (Lord, 1980;Hambelton and Jodoin, 2003). Meanwhile, Computer Adaptive Testing (CAT) techniques are used to determine a fixed student ability as quickly as possible by selecting test items based on information utility (Weiss and Kingsbury, 1984;Thissen and Mislevy, 2000;Settles et al., 2020). However, these methods, which have been used to develop efficient standardized tests, do not necessarily optimize a student's learning experience (Mu et al., 2018). We instead focus on tracking each student's evolving knowledge, choosing questions to target difficulty.
Knowledge Tracing (KT) seeks to model a student's knowledge state from their answer history in order to help individualize exercise sequences (Corbett and Anderson, 1995). This draws inspiration from traditional education curriculum practices, such as distributed spacing of vocabulary (Bloom and Shuell, 1981) and mixed review in mathematics (Rohrer, 2009). To address simplifying assumptions in earlier KT approaches, such as discrete knowledge representations, Piech et al. (2015) introduced Deep Knowledge Tracing (DKT), which uses RNNs to enable more complex knowledge representations for students. Recently, SAINT+ (Shin et al., 2020) showed state-of-the-art performance on the popular EdNet KT task using a Transformer model to capture temporal information across activities, motivating our use of Transformer LMs.
Controllable Text Generation aims to steer LMs towards desired attributes. Examples include using reinforcement learning to control quality metrics (Ranzato et al., 2016), adjusting sampling weights to control for poetry style (Ghazvininejad et al., 2017), and learning to condition on valence or domain-specific codes (Keskar et al., 2019;Peng et al., 2018). To the best of our knowledge, we are

Method
Given any autoregressive language model (e.g. GPT-2 (Radford et al., 2019), we can fine-tune a LM-KT model (p θ KT ) to predict whether an individual student will correctly answer the next question. If this model has well-calibrated uncertainty, we can use its predicted probability of a correct answer as a proxy for the difficulty of a question to a student. We then train a question generation model (p θ QG ) to generate a new question conditioned on a student and desired target difficulty.
Question Representation Unlike standard DKT, which treats questions as IDs or simple handcrafted features, we represent questions fully in text (e.g. "she eats" in Figure 1). This is a key contribution of our work, required by our eventual goal of generating questions in text, and allows the model to leverage similarity across linguistic features. We thus represent a question q as a sequence of words, with prefix and suffix tokens: We represent a student as a temporally-evolving sequence of questions and their responses. As in much previous KT work, we represent the student response as simply correct/incorrect, with special tokens <Y> and <N>. A student's current state is thus represented as a sequence of all past question and response pairs: .. q j m a j m , a i ∈ {<Y>,<N>} LM-KT Given the sequential nature of student learning over time, we can easily frame knowledge tracing as an autoregressive language modeling task. Given a dataset D of students s 1 , s 2 , ..., s |D| , we employ the standard training objective of finding the parameters θ KT that minimizes |x| ) is the entire sequence tokens corresponding to student s j , consisting of all their past questions and answers. Using the softmax output of the LM-KT model (p θ KT ), we estimate a student's (inverse) difficulty in answering a specific question as d qs = p θ KT (<Y>|s, q). We find that p θ KT is well-calibrated (Section 4.2), yielding a good proxy for the true question difficulty.

Question Generation
We frame question generation as finetuning a new autoregressive LM. Given random samples of students and questions from a held-out set not used to train LM-KT, we can construct a new dataset D consisting of s i d i <G> q i sequences, where <G> is a special generation token and d i = p θ KT (<Y>|s i , q i ) is the continuous difficulty value assigned by LM-KT. We learn a linear layer to map the continuous input difficulty into a difficulty control vector c d of dimension matching the LM word-embeddings, which we append to the token embeddings. Unlike LM-KT, we train our question generation model p θ QG to minimize the loss only on the question text, which only appears after the<G> token. If t g is the token index of <G>, then our modified loss is: where sequence x (j) contains the full s j d j <G>q j sequence. At test time, we generate tokens w 1 ...w n conditioned on the s j d j <G> prefix.

Experiments
Our method generalizes to any education activity that can be represented with text sequences. Due to the availability of real student learning data, we focus on a reverse language translation task, where a student translates phrases from their native language (e.g. English, "she eats") to the second language they are learning (e.g. Spanish, "ella come").

Experimental Details
We use the 2018 Duolingo Shared Task on Second Language Acquisition Modeling (Settles et al., 2018) dataset, which contains questions and responses for Duolingo users over the first 30 days of learning a second language. While the original task's goal was to identify token-level mistakes, we collapse these errors into binary (correct / incorrect) per-question labels. We use the provided train/dev/test splits for users learning Spanish and French. We create separate held-out sets from the test set to evaluate the LM-KT and question generation models. For both models, we finetune separate GPT-2 (Radford et al., 2019) models. While we sample from a held-out set of student states and questions to train the question generation model, in principle questions can come from any source text  domain. Further experiment details are in the Appendix, and source code can be found at: https:// github.com/meghabyte/acl2021-education.

Results: Student Modeling
We evaluate LM-KT two ways: first, its ability to predict if an individual student will answer a novel question correctly on a held-out test set of real Duolingo student responses. Second, how wellcalibrated these predictions are, which is crucial to our later use of LM-KT for question generation. Table 1 compares AUC-ROC on a held-out test set for our LM-KT model with standard DKT, which uses question IDs instead of text, and a baseline that ignores the student state, only using the question text representation. This question only baseline would perform well if the Duolingo dataset largely consisted of universally "easy" and "difficult" questions, independent of individual student. Our results show that incorporating the student state is crucial for accurately predicting Duolingo user responses, and including question text also leads to a significant improvement. LM-KT outperforms Standard DKT especially on novel questions-a necessary generalization ability for generation.
Finally, we measure the calibration of our LM-KT models for both Spanish and French (from En-glish) learners, which is the crucial property for our downstream generation task. We bin our test data by predicted question difficulty, and plot the fraction of true correct answers in each bin. Figure  2 shows that LM-KT is well-calibrated, for both Spanish and French, meaning the predicted difficulty matches the empirically observed proportion of correct answers.

Results: Question Generation
We evaluate four different aspects of our question generation model: (i) successful control for difficulty, (ii) novelty, (iii) fluency, and (iv) latency.
Difficulty Control To explore whether our question generation model indeed depends on target difficulty and the individual student, we first measure the model's perplexity on a held-out test set of Duolingo questions, compared to permutation baselines. Table 2 (top) shows that perplexity is lower for true student / target difficulty inputs than when either or both of these are permuted. The target difficulty values in this analysis were defined by the LM-DKT model. We can remove this dependence by using the actual student responses from Duolingo: we set the target difficulty to 1 if the student was correct and 0 otherwise. Table 2 (bottom) shows our model prefers questions paired with these "true correctness" targets than paired with random ones.
To evaluate how well our generation model achieves target difficulties, we take 15 unseen students and generate 30 questions for each of 9 input difficulties (0.1-0.9). We then use LM-KT (a wellcalibrated proxy for true difficulty) to measure the difficulty of these generated questions for each student. Figure 3 shows that we are able to achieve fine-grained control over target difficulty for both Spanish and French students, with an average Root-Mean Squared Error (RMSE) of .052 across all students and target difficulties. Adding a sampling penalty (Keskar et al., 2019) increases the variance in difficulty (RMSE .062) in exchange for more novel and diverse questions, as discussed next.
Novelty and Fluency By leveraging a pretrained language model's ability to manipulate structure, we can generate novel questions not present in the entire Duolingo question set (See Table 3). Across 4,050 questions generated for Spanish learners, we found that with a repetition penalty (   Latency Positive student experience in online education requires low latency. In about four seconds, our model can generate 30 questions close to a target difficulty. An alternative to question generation is to rank questions from a preexisting pool, according to a target difficulty objective. We compare the quality (RMSE in achieving target difficulty) of the top 30 questions in a pool against the run-time  required to rank all questions in the pool, varying its size (Figure 4). On one NVIDIA Titan XP GPU, we find that, averaged across all target difficulties, our question generation model takes half the time to achieve the same quality as pool selection. The gap increases when trying to sample harder questions ( d <0.5) -even a pool size of 1000 does not have sufficient difficult questions, likely due to a skew in the Duolingo question set. Additional controls, such as for style or topic, can easily be combined with our generation method, but would make pool selection exponentially more complex.

Conclusion
Our work is a first step toward showing that sequence-based models combined with domain knowledge, such as pre-trained LMs, can be leveraged for adaptive learning tasks. We show how to use modern LMs to generate novel reversetranslation questions that achieve a target difficulty, allowing adaptive education methods to expand beyond limited question pools. Limitations of our approach include the compute constraints of large LMs and training data availability. More detailed student data will be crucial to future model development. For instance, while most publicly available education datasets do not include the full student responses (e.g. full translation response in Duolingo), such information could significantly improve the performance of our LM-KT model. Other future directions include exploring non-language domains, such as math or logic exercises, and controlling for auxiliary objectives such as question topic.
Finally, designing appropriate user studies to evaluate our method is a complex yet critical next step to determine its suitability in a real-world education setting. Our techniques allows control for individual student difficulty, but it leaves open the question of optimal curriculum design using difficulty-directed question generation.

Broader Impact
Online education platforms can increase the accessibility of high quality educational resources for students around the world. Adaptive techniques that allow for more individualized learning strategies can help such technologies be more inclusive for students who make less-common mistakes or have different prior backgrounds (Lee and Brunskill, 2012). However, our method is subject to biases found in the training data, and careful consideration of using safe and appropriate data is crucial in an education context. Moreover, our specific use of pre-trained LMs relies on the significant progress of NLP tools for English language -further research and development of these tools for other languages can help ensure our method benefits a larger population of students.

A.1 Dataset Details
The 2018 Duolingo Shared Task on Second Language Acquisition Modeling (Settles et al., 2018) dataset contains questions and responses for Duolingo users over the first 30 days of learning a second language. The dataset contains three different question types: reverse translate (free response translation of a given prompt in the language they are learning), reverse tap (a selection-based equivalent of reverse translate), and listen, where students listen to a vocal utterance. We focus on the reverse translate question type for English-speaking students learning French and Spanish. The dataset size for French learners (1.2k users) is roughly half the size of that for Spanish learners (2.6k users).
Because the original dataset was intended for per-token error prediction, each question has per-token information that includes whether the student translated the token correctly, as well as Universal Dependencies tags such as part of speech and morphology labels. We use the full question text, rather than individual tokens, for our task, and combine the labels such that if a Duolingo user incorrectly translated one or more tokens in a question, the entire question is marked incorrect. We do not use any additional features.
We use the publicly provided train/dev/test splits from the Shared Task, which are temporally ordered in sequence. We therefore construct student states by tracking user IDs throughout the datasets and appending each new question and response to the current student state. When evaluating our LM-KT model, we use the true responses of preceding questions in the test set to form the student state for a given question. Overall, we find that the dataset is severely imbalanced (as in the original task) -about 30% of questions are answered incorrectly across students studying both French and Spanish.
Finally, we create a held-out set of Duolingo questions for both French and Spanish learners to create the training data for our question generation model. From a set of random student states, we select questions from this set and use a trained LM-KT model to assign the difficulty score. In practice, this held-out set can come from any source, not just Duolingo data.

A.2 Model Training Details
To train both our LM-KT knowledge tracing model and our question generation model, we use the pre-trained OpenAI GPT-2 model from the HuggingFace Transformers library (Wolf et al., 2020). For question generation, we modify the library to add a linear layer and the modified loss function for question generation from Section 3.
We use 1 NVIDIA TitanXP GPU with 12GB of memory available. Because the maximum input sequence length of the GPT-2 model we use is 1024 tokens, we resize all inputs to the last 1024 tokens before training. We report results for an LM-KT model trained for 13k steps with the default batch size of 2 and learning rate of 5e-5, and a Question Generation model trained for 25k steps with the same batch size and learning rate. The total compute time to train both models was 2.5 hours for each language learning task.

A.3 Question Generation Details
For both French and Spanish question generation models, we select 15 students unseen during training and generate 30 questions across 9 difficulties from 0.1 to 0.9, using nucleus sampling (Holtzman et al., 2020) (p = 0.99) with a maximum output length of 20 tokens. We also vary a repetition penalty (Keskar et al., 2019) that penalizes for previous tokens (including those in the student state). Lastly, we resize all prompts (student state and target difficulty) to fit into the GPT-2 Model by taking the most recent 1024 tokens, as in training. This is a limitation of our work, as the full student history is not able to be considered for students who have answered a large set of questions.

A.4 Additional Question Generation Outputs
Our question generation model demonstrates the ability to generate novel questions that do not exist in the entire Duolingo question dataset, especially when a sampling penalty is applied to encourage more diverse outputs. However, this comes at a cost to fluency. Below we include a set of outputs generated by our model for 1 Spanish student and 1 French student from the Duolingo dataset, with a target difficulty of d = 0.1, and both with and without a repetition penalty. We observe that while applying a penalty results in a far more novel questions generated, several of these are also non-fluent, using a combination of manual judgement and the Python language-check package (https://pypi.org/project/language-check/). she blames us! clean the mirror.
she reads us lunchtime newspapers. i do not know it.
she reads your letters. i read the newspaper.
those ducks drink water.
i want a sandwich without cheese.
we can abandon him. june starts tomorrow.
what book have they Chosen me so far? she reads the calendar.
you can control her water. the plates are not big.
you can establish two properties. we are following the clue.
your house is very put-pretty! we drink quickly.
previously on television we eat strawberries.
you can create the menu. you can control the water.
you write letters. you can create the menu.
your hat is gray you can establish a restaurant.