Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress

Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish (Editors)


Anthology ID:
2025.aimecon-wip
Month:
October
Year:
2025
Address:
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
Venue:
AIME-Con
SIG:
Publisher:
National Council on Measurement in Education (NCME)
URL:
https://aclanthology.org/2025.aimecon-wip/
DOI:
ISBN:
979-8-218-84229-1
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2025.aimecon-wip.pdf

pdf bib
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress
Joshua Wilson | Christopher Ormerod | Magdalen Beiting Parrish

pdf bib
Automated Item Neutralization for Non-Cognitive Scales: A Large Language Model Approach to Reducing Social-Desirability Bias
Sirui Wu | Daijin Yang

This study explores an AI-assisted approach for rewriting personality scale items to reduce social desirability bias. Using GPT-refined neutralized items based on the IPIP-BFM-50, we compare factor structures, item popularity, and correlations with the MC-SDS to evaluate construct validity and the effectiveness of AI-based item refinement in Chinese contexts.

pdf bib
AI as a Mind Partner: Cognitive Impact in Pakistan’s Educational Landscape
Eman Khalid | Hammad Javaid | Yashal Waseem | Natasha Sohail Barlas

This study explores how high school and university students in Pakistan perceive and use generative AI as a cognitive extension. Drawing on the Extended Mind Theory, impact on critical thinking, and ethics are evaluated. Findings reveal over-reliance, mixed emotional responses, and institutional uncertainty about AI’s role in learning.

pdf bib
Detecting Math Misconceptions: An AI Benchmark Dataset
Bethany Rittle-Johnson | Rebecca Adler | Kelley Durkin | L Burleigh | Jules King | Scott Crossley

To harness the promise of AI for improving math education, AI models need to be able to diagnose math misconceptions. We created an AI benchmark dataset on math misconceptions and other instructionally-relevant errors, comprising over 52,000 explanations written over 15 math questions that were scored by expert human raters.

pdf bib
Optimizing Opportunity: An AI-Driven Approach to Redistricting for Fairer School Funding
Jordan Abbott

We address national educational inequity driven by school district boundaries using a comparative AI framework. Our models, which redraw boundaries from scratch or consolidate existing districts, generate evidence-based plans that reduce funding and segregation disparities, offering policymakers scalable, data-driven solutions for systemic reform.

pdf bib
Automatic Grading of Student Work Using Simulated Rubric-Based Data and GenAI Models
Yiyao Yang | Yasemin Gulbahar

Grading assessment in data science faces challenges related to scalability, consistency, and fairness. Synthetic dataset and GenAI enable us to simulate realistic code samples and automatically evaluate using rubric-driven systems. The research proposes an automatic grading system for generated Python code samples and explores GenAI grading reliability through human-AI comparison.

pdf bib
Cognitive Engagement in GenAI Tutor Conversations: At-scale Measurement and Impact on Learning
Kodi Weatherholtz | Kelli Millwood Hill | Kristen Dicerbo | Walt Wells | Phillip Grimaldi | Maya Miller-Vedam | Charles Hogg | Bogdan Yamkovenko

We developed and validated a scalable LLM-based labeler for classifying student cognitive engagement in GenAI tutoring conversations. Higher engagement levels predicted improved next-item performance, though further research is needed to assess distal transfer and to disentangle effects of continued tutor use from true learning transfer.

pdf bib
Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing
Tianwen Li | Michelle Hong | Lindsay Clare Matsumura | Elaine Lin Wang | Diane Litman | Zhexiong Liu | Richard Correnti

This study explores the use of ChatGPT-4.1 as a formative assessment tool for identifying revision patterns in young adolescents’ argumentative writing. ChatGPT-4.1 shows moderate agreement with human coders on identifying evidence-related revision patterns and fair agreement on explanation-related ones. Implications for LLM-assisted formative assessment of young adolescent writing are discussed.

pdf bib
Predicting and Evaluating Item Responses Using Machine Learning, Text Embeddings, and LLMs
Evelyn Johnson | Hsin-Ro Wei | Tong Wu | Huan Liu

This work-in-progress study compares the accuracy of machine learning and large language models to predict student responses to field-test items on a social-emotional learning assessment. We evaluate how well each method replicates actual responses and examine the item parameters generated by synthetic data to those derived from actual student data.

pdf bib
Evaluating LLM-Based Automated Essay Scoring: Accuracy, Fairness, and Validity
Yue Huang | Joshua Wilson

This study evaluates large language models (LLMs) for automated essay scoring (AES), comparing prompt strategies and fairness across student groups. We found that well-designed prompting helps LLMs approach traditional AES performance, but both differ from human scores for ELLs—the traditional model shows larger overrall gaps, while LLMs show subtler disparities.

pdf bib
Comparing AI tools and Human Raters in Predicting Reading Item Difficulty
Hongli Li | Roula Aldib | Chad Marchong | Kevin Fan

This study compares AI tools and human raters in predicting the difficulty of reading comprehension items without response data. Predictions from AI models (ChatGPT, Gemini, Claude, and DeepSeek) and human raters are evaluated against empirical difficulty values derived from student responses. Findings will inform AI’s potential to support test development.

pdf bib
When Machines Mislead: Human Review of Erroneous AI Cheating Signals
William Belzak | Chenhao Niu | Angel Ortmann Lee

This study examines how human proctors interpret AI-generated alerts for misconduct in remote assessments. Findings suggest proctors can identify false positives, though confirmation bias and differences across test-taker nationalities were observed. Results highlight opportunities to refine proctoring guidelines and strengthen fairness in human oversight of automated signals in high-stakes testing.

pdf bib
Fairness in Formative AI: Cognitive Complexity in Chatbot Questions Across Research Topics
Alexandra Barry Colbert | Karen D Wang

This study evaluates whether questions generated from a socratic-style research AI chatbot designed to support project-based AP courses maintains cognitive complexity parity when inputted with research topics of controversial and non-controversial nature. We present empirical findings indicating no significant conversational complexity differences, highlighting implications for equitable AI use in formative assessment.

pdf bib
Keystroke Analysis in Digital Test Security: AI Approaches for Copy-Typing Detection and Cheating Ring Identification
Chenhao Niu | Yong-Siang Shih | Manqian Liao | Ruidong Liu | Angel Ortmann Lee

This project leverages AI-based analysis of keystroke and mouse data to detect copy-typing and identify cheating rings in the Duolingo English Test. By modeling behavioral biometrics, the approach provides actionable signals to proctors, enhancing digital test security for large-scale online assessment.

pdf bib
Talking to Learn: A SoTL Study of Generative AI-Facilitated Feynman Reviews
Madeline Rose Mattox | Natalie Hutchins | Jamie J Jirout

Structured Generative AI interactions have potential for scaffolding learning. This Scholarship of Teaching and Learning study analyzes 16 undergraduate students’ Feynman-style AI interactions (N=157) across a semester-long child-development course. Qualitative coding of the interactions explores engagement patterns, metacognitive support, and response consistency, informing ethical AI integration in higher education.

pdf bib
AI-Powered Coding of Elementary Students’ Small-Group Discussions about Text
Carla Firetto | P. Karen Murphy | Lin Yan | Yue Tang

We report reliability and validity evidence for an AI-powered coding of 371 small-group discussion transcripts. Evidence via comparability and ground truth checks suggested high consistency between AI-produced and human-produced codes. Research in progress is also investigating reliability and validity of a new “quality” indicator to complement the current coding.

pdf bib
Evaluating the Reliability of Human–AI Collaborative Scoring of Written Arguments Using Rational Force Model
Noriko Takahashi | Abraham Onuorah | Alina Reznitskaya | Evgeny Chukharev | Ariel Sykes | Michele Flammia | Joe Oyler

This study aims to improve the reliability of a new AI collaborative scoring system used to assess the quality of students’ written arguments. The system draws on the Rational Force Model and focuses on classifying the functional relation of each proposition in terms of support, opposition, acceptability, and relevance.

pdf bib
Evaluating Deep Learning and Transformer Models on SME and GenAI Items
Joe Betts | William Muntean

This study leverages deep learning, transformer models, and generative AI to streamline test development by automating metadata tagging and item generation. Transformer models outperform simpler approaches, reducing SME workload. Ongoing research refines complex models and evaluates LLM-generated items, enhancing efficiency in test creation.

pdf bib
Comparison of AI and Human Scoring on A Visual Arts Assessment
Ning Jiang | Yue Huang | Jie Chen

This study examines reliability and comparability of Generative AI scores versus human ratings on two performance tasks—text-based and drawing-based—in a fourth-grade visual arts assessment. Results show GPT-4 is consistent, aligned with humans but more lenient, and its agreement with humans is slightly lower than that between human raters.

pdf bib
Explainable Writing Scores via Fine-grained, LLM-Generated Features
James V Bruno | Lee Becker

Advancements in deep learning have enhanced Automated Essay Scoring (AES) accuracy but reduced interpretability. This paper investigates using LLM-generated features to train an explainable scoring model. By framing feature engineering as prompt engineering, state-of-the-art language technology can be integrated into simpler, more interpretable AES models.

pdf bib
Validating Generative AI Scoring of Constructed Responses with Cognitive Diagnosis
Hyunjoo Kim

This research explores the feasibility of applying the cognitive diagnosis assessment (CDA) framework to validate generative AI-based scoring of constructed responses (CRs). The classification information of CRs and item-parameter estimates from cognitive diagnosis models (CDMs) could provide additional validity evidence for AI-generated CR scores and feedback.

pdf bib
Automated Diagnosis of Students’ Number Line Strategies for Fractions
Zhizhi Wang | Dake Zhang | Min Li | Yuhan Tao

This study aims to develop and evaluate an AI-based platform that automatically grade and classify problem-solving strategies and error types in students’ handwritten fraction representations involving number lines. The model development procedures, and preliminary evaluation results comparing with available LLMs and human expert annotations are reported.

pdf bib
Medical Item Difficulty Prediction Using Machine Learning
Hope Oluwaseun Adegoke | Ying Du | Andrew Dwyer

This project aims to use machine learning models to predict a medical exam item difficulty by combining item metadata, linguistic features, word embeddings, and semantic similarity measures with a sample size of 1000 items. The goal is to improve the accuracy of difficulty prediction in medical assessment.

pdf bib
Examining decoding items using engine transcriptions and scoring in early literacy assessment
Zachary Schultz | Mackenzie Young | Debbie Dugdale | Susan Lottridge

We investigate the reliability of two scoring approaches to early literacy decoding items, whereby students are shown a word and asked to say it aloud. Approaches were rubric scoring of speech, human or AI transcription with varying explicit scoring rules. Initial results suggest rubric-based approaches perform better than transcription-based methods.

pdf bib
Addressing Few-Shot LLM Classification Instability Through Explanation-Augmented Distillation
William Muntean | Joe Betts

This study compares explanation-augmented knowledge distillation with few-shot in-context learning for LLM-based exam question classification. Fine-tuned smaller language models achieved competitive performance with greater consistency than large mode few-shot approaches, which exhibited notable variability across different examples. Hyperparameter selection proved essential, with extremely low learning rates significantly impairing model performance.

pdf bib
Identifying Biases in Large Language Model Assessment of Linguistically Diverse Texts
Lionel Hsien Meng | Shamya Karumbaiah | Vivek Saravanan | Daniel Bolt

The development of Large Language Models (LLMs) to assess student text responses is rapidly progressing but evaluating whether LLMs equitably assess multilingual learner responses is an important precursor to adoption. Our study provides an example procedure for identifying and quantifying bias in LLM assessment of student essay responses.

pdf bib
Implicit Biases in Large Vision–Language Models in Classroom Contexts
Peter Baldwin

Using a counterfactual, adversarial, audit-style approach, we tested whether ChatGPT-4o evaluates classroom lectures differently based on teacher demographics. The model was told only to rate lecture excerpts embedded within classroom images—without reference to the images themselves. Despite this, ratings varied systematically by teacher race and sex, revealing implicit bias.

pdf bib
Enhancing Item Difficulty Prediction in Large-scale Assessment with Large Language Model
Mubarak Mojoyinola | Olasunkanmi James Kehinde | Judy Tang

Field testing is a resource-intensive bottleneck in test development. This study applied an interpretable framework that leverages a Large Language Model (LLM) for structured feature extraction from TIMSS items. These features will train several classifiers, whose predictions will be explained using SHAP, providing actionable, diagnostic insights insights for item writers.

pdf bib
Leveraging LLMs for Cognitive Skill Mapping in TIMSS Mathematics Assessment
Ruchi J Sachdeva | Jung Yeon Park

This study evaluates ChatGPT-4’s potential to support validation of Q-matrices and analysis of complex skill–item interactions. By comparing its outputs to expert benchmarks, we assess accuracy, consistency, and limitations, offering insights into how large language models can augment expert judgment in diagnostic assessment and cognitive skill mapping.