KHANQ: A Dataset for Generating Deep Questions in Education

Huanli Gong, Liangming Pan, Hengchang Hu


Abstract
Designing in-depth educational questions is a time-consuming and cognitively demanding task. Therefore, it is intriguing to study how to build Question Generation (QG) models to automate the question creation process. However, existing QG datasets are not suitable for educational question generation because the questions are not real questions asked by humans during learning and can be solved by simply searching for information. To bridge this gap, we present KHANQ, a challenging dataset for educational question generation, containing 1,034 high-quality learner-generated questions seeking an in-depth understanding of the taught online courses in Khan Academy. Each data sample is carefully paraphrased and annotated as a triple of 1) Context: an independent paragraph on which the question is based; 2) Prompt: a text prompt for the question (e.g., the learner’s background knowledge); 3) Question: a deep question based on Context and coherent with Prompt. By conducting a human evaluation on the aspects of appropriateness, coverage, coherence, and complexity, we show that state-of-the-art QG models which perform well on shallow question generation datasets have difficulty in generating useful educational questions. This makes KHANQ a challenging testbed for educational question generation.
Anthology ID:
2022.coling-1.518
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
5925–5938
Language:
URL:
https://aclanthology.org/2022.coling-1.518
DOI:
Bibkey:
Cite (ACL):
Huanli Gong, Liangming Pan, and Hengchang Hu. 2022. KHANQ: A Dataset for Generating Deep Questions in Education. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5925–5938, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
KHANQ: A Dataset for Generating Deep Questions in Education (Gong et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.518.pdf
Data
HotpotQA