Giovanni Aradelli


2024

pdf bib
Using LLMs to simulate students’ responses to exam questions
Luca Benedetto | Giovanni Aradelli | Antonia Donvito | Alberto Lucchetti | Andrea Cappelli | Paula Buttery
Findings of the Association for Computational Linguistics: EMNLP 2024

Previous research leveraged Large Language Models (LLMs) in numerous ways in the educational domain. Here, we show that they can be used to answer exam questions simulating students of different skill levels and share a prompt, engineered for GPT-3.5, that enables the simulation of varying student skill levels on questions from different educational domains. We evaluate the proposed prompt on three publicly available datasets (one from science exams and two from English reading comprehension exams) and three LLMs (two versions of GPT-3.5 and one of GPT-4), and show that it is robust to different educational domains and capable of generalising to data unseen during the prompt engineering phase. We also show that, being engineered for a specific version of GPT-3.5, the prompt does not generalise well to different LLMs, stressing the need for prompt engineering for each model in practical applications. Lastly, we find that there is not a direct correlation between the quality of the rationales obtained with chain-of-thought prompting and the accuracy in the student simulation task.

2021

pdf bib
On the application of Transformers for estimating the difficulty of Multiple-Choice Questions from text
Luca Benedetto | Giovanni Aradelli | Paolo Cremonesi | Andrea Cappelli | Andrea Giussani | Roberto Turrin
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

Classical approaches to question calibration are either subjective or require newly created questions to be deployed before being calibrated. Recent works explored the possibility of estimating question difficulty from text, but did not experiment with the most recent NLP models, in particular Transformers. In this paper, we compare the performance of previous literature with Transformer models experimenting on a public and a private dataset. Our experimental results show that Transformers are capable of outperforming previously proposed models. Moreover, if an additional corpus of related documents is available, Transformers can leverage that information to further improve calibration accuracy. We characterize the dependence of the model performance on some properties of the questions, showing that it performs best on questions ending with a question mark and Multiple-Choice Questions (MCQs) with one correct choice.