Russel Dsouza

2025

pdf bib abs
Automatic Scoring of an Open-Response Measure of Advanced Mind-Reading Using Large Language Models
Yixiao Wang | Russel Dsouza | Robert Lee | Ian Apperly | Rory Devine | Sanne van der Kleij | Mark Lee
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025)

A rigorous psychometric approach is crucial for the accurate measurement of mind-reading abilities. Traditional scoring methods for such tests, which involve lengthy free-text responses, require considerable time and human effort. This study investigates the use of large language models (LLMs) to automate the scoring of psychometric tests. Data were collected from participants aged 13 to 30 years and scored by trained human coders to establish a benchmark. We evaluated multiple LLMs against human assessments, exploring various prompting strate- gies to optimize performance and fine-tuning the models using a subset of the collected data to enhance accuracy. Our results demonstrate that LLMs can assess advanced mind-reading abilities with over 90% accuracy on average. Notably, in most test items, the LLMs achieved higher Kappa agreement with the lead coder than two trained human coders, highlighting their potential to reliably score open-response psychometric tests.

pdf bib abs
Sources of Disagreement in Data for LLM Instruction Tuning
Russel Dsouza | Venelin Kovatchev
Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation

In this paper we study the patterns of label disagreement in data used for instruction tuning Large Language models (LLMs). Specifically, we focus on data used for Reinforcement Learning from Human Feedback (RLHF). Our objective is to determine what is the primary source of disagreement: the individual data points, the choice of annotators, or the task formulation. We annotate the same dataset multiple times under different conditions and compare the overall agreement and the patterns of disagreement. For task formulation, we compare “single” format where annotators rate LLM responses individually with “preference” format where annotators select one of two possible responses. For annotators, we compare data from human labelers with automatic data labeling using LLMs. Our results indicate that: (1) there are very few “universally ambiguous” instances. The label disagreement depends largely on the task formulation and the choice of annotators; (2) the overall agreement remains consistent across experiments. We find no evidence that “preference” data is of higher quality than “single” data; and (3) the change of task formulation and annotators impacts the resulting instance-level labels. The labels obtained in different experiments are correlated, but not identical.

In this paper, we present the system proposed by our team OldJoe, for the 8th edition of the AVeriTeC shared task, as part of the FEVER workshop. The objective of this task is to verify the factuality of real-world claims. Our approach integrates open source large language models, SQL, and in-context learning. We begin with embedding the knowledge store using a pretrained embedding language model then storing the outputs in a SQL database. Subsequently, we prompt an LLM to craft relevant questions based on the input claim, which are then used to guide the retrieval process. We further prompt the LLM to generate answers to the questions and predict the veracity of the original claim. Our system scored 0.49 on the HU-METEOR AVeriTeC score on the dev set and 0.15 on the Ev2R recall on the test set. Due to the time constraint we were unable to conduct additional experiments or further hyperparameter tuning. As a result, we adopted this pipeline configuration centered on the Qwen3-14B-AWQ model as our final submission strategy. The full pipeline is available on GitHub: https://github.com/farahft/OldJoe

Co-authors

Rory Devine 1

Yue Feng 1

Farah Ftouhi 1

Lance Calvin Lim Gamboa 1

Robert Lee 1

Yixiao Wang 1

Sanne van der Kleij 1

Venues

Fix author