PROTEGE: Prompt-based Diverse Question Generation from Web Articles

Vinayak Puranik, Anirban Majumder, Vineet Chaoji


Abstract
Rich and diverse knowledge bases (KB) are foundational building blocks for online knowledge sharing communities such as StackOverflow and Quora, and applications such as conversational assistants (aka chatbots). A popular format for knowledge bases is question-answer pairs (or FAQs), where questions are designed to accurately match a multitude of queries. In this paper, we address the problem of automatic creation of such Q&A-based knowledge bases from domain-specific, long-form textual content (e.g., web articles). Specifically, we consider the problem of question generation, which is the task of generating questions given a paragraph of text as input, with a goal to achieve both diversity and fidelity of the generated questions. Towards this goal we propose PROTEGE, a diverse question generation framework which consists of (1) a novel encoder-decoder based Large Language Model (LLM) architecture which can take a variety of prompts and generate a diverse set of candidate questions, and (2) a hill-climbing algorithm that maximizes a sub-modular objective function to balance diversity with fidelity. Through our experiments on three popular public Q&A datasets, we demonstrate that PROTEGE improves diversity by +16% and fidelity by +8% over diverse beam search and prompt-based baselines.
Anthology ID:
2023.findings-emnlp.362
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5449–5463
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.362
DOI:
10.18653/v1/2023.findings-emnlp.362
Bibkey:
Cite (ACL):
Vinayak Puranik, Anirban Majumder, and Vineet Chaoji. 2023. PROTEGE: Prompt-based Diverse Question Generation from Web Articles. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5449–5463, Singapore. Association for Computational Linguistics.
Cite (Informal):
PROTEGE: Prompt-based Diverse Question Generation from Web Articles (Puranik et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.362.pdf