Diane Nicholls


2026

We investigate the automated evaluation of English language learner answers to writing tasks featuring picture stories.This is usually limited to language proficiency only, neglecting the context of the picture. Instead, our analysis focuses on task adherence, which for example allows detection of off-topic answers.Since there is a lack of suitable training and evaluation data, our first step is to build the PictureStories dataset.To this end, we develop a marking rubric that covers task adherence with respect to both form and content. Six annotators mark 713 learner answers written in response to one of five picture stories.Having assembled the dataset, we then explore to what extent task adherence can be predicted automatically. Our experiments assume a scenario where no or just a few labelled answers are available for the picture story which is being marked.For form-focused criteria, we find that it is beneficial to finetune models across tasks.With content-focused criteria, few-shot prompting Qwen emerges as the best-performing method. We examine the trade-off between including the story image vs. example answers in the prompt and find that examples suffice in many cases. While for some LLMs, few-shot prompting results may look promising on the surface, we demonstrate that a much simpler method can do just as well when shown the same examples.

2025

Knowing which words language learners struggle with is crucial for developing personalised education technologies. In this paper, we advocate for the novel task of “dictionary look-up prediction” as a means for evaluating the complexity of words in reading tasks. We release the Dictionary Look-Up development dataset (DLU-dev) and the Dialogue Dictionary Look-Up dataset (D-DLU), which is based on chatbot dialogues. We demonstrate that dictionary look-up is a challenging task for LLMs (results are presented for LLaMA, Gemma, and Longformer models). We explore finetuning with the ROC* loss function as a more appropriate loss for this task than the commonly used Binary Cross Entropy (BCE). We show that a feature-based model outperforms the LLMs. Finally, we investigate the transfer between DLU and the related tasks of Complex Word Identification (CWI) and Semantic Error Prediction (SEP), establishing new state-of-the-art results for SEP.

2024

Essay writing is a skill commonly taught and practised in schools. The ability to write a fluent and persuasive essay is often a major component of formal assessment. In natural language processing and education technology we may work with essays in their final form, for example to carry out automated assessment or grammatical error correction. In this work we collect and analyse data representing the essay writing process from start to finish, by recording every key stroke from multiple writers participating in our study. We describe our data collection methodology, the characteristics of the resulting dataset, and the assignment of proficiency levels to the texts. We discuss the ways the keystroke data can be used – for instance seeking to identify patterns in the keystrokes which might act as features in automated assessment or may enable further advancements in writing assistance – and the writing support technology which could be built with such information, if we can detect when writers are struggling to compose a section of their essay and offer appropriate intervention. We frame this work in the context of English language learning, but we note that keystroke logging is relevant more broadly to text authoring scenarios as well as cognitive or linguistic analyses of the writing process.