Andrea Cappelli
2026
Beyond Names: How Grammatical Gender Markers Bias LLM-based Educational Recommendations
Luca Benedetto | Antonia Donvito | Alberto Lucchetti | Andrea Cappelli | Paula Buttery
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Luca Benedetto | Antonia Donvito | Alberto Lucchetti | Andrea Cappelli | Paula Buttery
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper investigates gender biases exhibited by LLM-based virtual assistants when providing educational recommendations, focusing on minimal gender indicators. Experimenting on Italian, a language with grammatical gender, we demonstrate that simply changing noun and adjective endings (e.g., from masculine "-o" to feminine "-a") significantly shifts recommendations. More specifically, we find that LLMs i) recommend STEM disciplines less for prompts with feminine grammatical gender and ii) narrow down the set of disciplines recommended to prompts with masculine grammatical gender; these effects persist across multiple commercial LLMs (from OpenAI, Anthropic, and Google). We show that grammatical gender cues alone trigger substantial distributional shifts in educational recommendations, and up to 76% of the bias exhibited when using prompts with proper names is already present with grammatical gender markers alone.Our findings highlight the need for robust bias evaluation and mitigation strategies before deploying LLM-based virtual assistants in student-facing contexts and the risks of using general purpose LLMs for educational applications, especially in languages with grammatical gender.
2024
Using LLMs to simulate students’ responses to exam questions
Luca Benedetto | Giovanni Aradelli | Antonia Donvito | Alberto Lucchetti | Andrea Cappelli | Paula Buttery
Findings of the Association for Computational Linguistics: EMNLP 2024
Luca Benedetto | Giovanni Aradelli | Antonia Donvito | Alberto Lucchetti | Andrea Cappelli | Paula Buttery
Findings of the Association for Computational Linguistics: EMNLP 2024
Previous research leveraged Large Language Models (LLMs) in numerous ways in the educational domain. Here, we show that they can be used to answer exam questions simulating students of different skill levels and share a prompt, engineered for GPT-3.5, that enables the simulation of varying student skill levels on questions from different educational domains. We evaluate the proposed prompt on three publicly available datasets (one from science exams and two from English reading comprehension exams) and three LLMs (two versions of GPT-3.5 and one of GPT-4), and show that it is robust to different educational domains and capable of generalising to data unseen during the prompt engineering phase. We also show that, being engineered for a specific version of GPT-3.5, the prompt does not generalise well to different LLMs, stressing the need for prompt engineering for each model in practical applications. Lastly, we find that there is not a direct correlation between the quality of the rationales obtained with chain-of-thought prompting and the accuracy in the student simulation task.
2021
On the application of Transformers for estimating the difficulty of Multiple-Choice Questions from text
Luca Benedetto | Giovanni Aradelli | Paolo Cremonesi | Andrea Cappelli | Andrea Giussani | Roberto Turrin
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Luca Benedetto | Giovanni Aradelli | Paolo Cremonesi | Andrea Cappelli | Andrea Giussani | Roberto Turrin
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Classical approaches to question calibration are either subjective or require newly created questions to be deployed before being calibrated. Recent works explored the possibility of estimating question difficulty from text, but did not experiment with the most recent NLP models, in particular Transformers. In this paper, we compare the performance of previous literature with Transformer models experimenting on a public and a private dataset. Our experimental results show that Transformers are capable of outperforming previously proposed models. Moreover, if an additional corpus of related documents is available, Transformers can leverage that information to further improve calibration accuracy. We characterize the dependence of the model performance on some properties of the questions, showing that it performs best on questions ending with a question mark and Multiple-Choice Questions (MCQs) with one correct choice.