Antonia Donvito


2026

This paper investigates gender biases exhibited by LLM-based virtual assistants when providing educational recommendations, focusing on minimal gender indicators. Experimenting on Italian, a language with grammatical gender, we demonstrate that simply changing noun and adjective endings (e.g., from masculine "-o" to feminine "-a") significantly shifts recommendations. More specifically, we find that LLMs i) recommend STEM disciplines less for prompts with feminine grammatical gender and ii) narrow down the set of disciplines recommended to prompts with masculine grammatical gender; these effects persist across multiple commercial LLMs (from OpenAI, Anthropic, and Google). We show that grammatical gender cues alone trigger substantial distributional shifts in educational recommendations, and up to 76% of the bias exhibited when using prompts with proper names is already present with grammatical gender markers alone.Our findings highlight the need for robust bias evaluation and mitigation strategies before deploying LLM-based virtual assistants in student-facing contexts and the risks of using general purpose LLMs for educational applications, especially in languages with grammatical gender.

2024

Previous research leveraged Large Language Models (LLMs) in numerous ways in the educational domain. Here, we show that they can be used to answer exam questions simulating students of different skill levels and share a prompt, engineered for GPT-3.5, that enables the simulation of varying student skill levels on questions from different educational domains. We evaluate the proposed prompt on three publicly available datasets (one from science exams and two from English reading comprehension exams) and three LLMs (two versions of GPT-3.5 and one of GPT-4), and show that it is robust to different educational domains and capable of generalising to data unseen during the prompt engineering phase. We also show that, being engineered for a specific version of GPT-3.5, the prompt does not generalise well to different LLMs, stressing the need for prompt engineering for each model in practical applications. Lastly, we find that there is not a direct correlation between the quality of the rationales obtained with chain-of-thought prompting and the accuracy in the student simulation task.