Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference for Cost-Effective Cultural Heritage Dataset Generation

William Thorne; Ambrose Robinson; Bohua Peng; Chenghua Lin; Diana Maynard

doi:10.18653/v1/2024.nlp4dh-1.43

Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference for Cost-Effective Cultural Heritage Dataset Generation

William Thorne, Ambrose Robinson, Bohua Peng, Chenghua Lin, Diana Maynard

Abstract

As the cultural heritage sector increasingly adopts technologies like Retrieval-Augmented Generation (RAG) to provide more personalised search experiences and enable conversations with collections data, the demand for specialised evaluation datasets has grown. While end-to-end system testing is essential, it’s equally important to assess individual components. We target the final, answering task, which is well-suited to Machine Reading Comprehension (MRC). Although existing MRC datasets address general domains, they lack the specificity needed for cultural heritage information. Unfortunately, the manual creation of such datasets is prohibitively expensive for most heritage institutions. This paper presents a cost-effective approach for generating domain-specific MRC datasets with increased difficulty using Reinforcement Learning from Human Feedback (RLHF) from synthetic preference data. Our method leverages the performance of existing question-answering models on a subset of SQuAD to create a difficulty metric, assuming that more challenging questions are answered correctly less frequently. This research contributes: (1) A methodology for increasing question difficulty using PPO and synthetic data; (2) Empirical evidence of the method’s effectiveness, including human evaluation; (3) An in-depth error analysis and study of emergent phenomena; and (4) An open-source codebase and set of three llama-2-chat adapters for reproducibility and adaptation.

Anthology ID:: 2024.nlp4dh-1.43
Volume:: Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Month:: November
Year:: 2024
Address:: Miami, USA
Editors:: Mika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni
Venues:: NLP4DH | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 450–462
Language:
URL:: https://aclanthology.org/2024.nlp4dh-1.43/
DOI:: 10.18653/v1/2024.nlp4dh-1.43
Bibkey:
Cite (ACL):: William Thorne, Ambrose Robinson, Bohua Peng, Chenghua Lin, and Diana Maynard. 2024. Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference for Cost-Effective Cultural Heritage Dataset Generation. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 450–462, Miami, USA. Association for Computational Linguistics.
Cite (Informal):: Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference for Cost-Effective Cultural Heritage Dataset Generation (Thorne et al., NLP4DH 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.nlp4dh-1.43.pdf

PDF Cite Search Fix data