uppdf
bib
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Joshua Wilson
|
Christopher Ormerod
|
Magdalen Beiting Parrish
pdf
bib
abs
Input Optimization for Automated Scoring in Reading Assessment
Ji Yoon Jung
|
Ummugul Bezirhan
|
Matthias von Davier
This study examines input optimization for enhanced efficiency in automated scoring (AS) of reading assessments, which typically involve lengthy passages and complex scoring guides. We propose optimizing input size using question-specific summaries and simplified scoring guides. Findings indicate that input optimization via compression is achievable while maintaining AS performance.
pdf
bib
abs
Implementation Considerations for Automated AI Grading of Student Work
Zewei Tian
|
Alex Liu
|
Lief Esbenshade
|
Shawon Sarkar
|
Zachary Zhang
|
Kevin He
|
Min Sun
19 K-12 teachers participated in a co-design pilot study of an AI education platform, testing assessment grading. Teachers valued AI’s rapid narrative feedback for formative assessment but distrusted automated scoring, preferring human oversight. Students appreciated immediate feedback but remained skeptical of AI-only grading, highlighting needs for trustworthy, teacher-centered AI tools.
pdf
bib
abs
Compare Several Supervised Machine Learning Methods in Detecting Aberrant Response Pattern
Yi Lu
|
Yu Zhang
|
Lorin Mueller
An aberrant response pattern, e.g., a test taker is able to answer difficult questions correctly, but is unable to answer easy questions correctly, are first identified lz and lz*. We then compared the performance of five supervised machine learning methods in detecting aberrant response pattern identified by lz or lz*.
pdf
bib
abs
Leveraging multi-AI agents for a teacher co-design
Hongwen Guo
|
Matthew S. Johnson
|
Luis Saldivia
|
Michelle Worthington
|
Kadriye Ercikan
This study uses multi-AI agents to accelerate teacher co-design efforts. It innovatively links student profiles obtained from numerical assessment data to AI agents in natural languages. The AI agents simulate human inquiry, enrich feedback and ground it in teachers’ knowledge and practice, showing significant potential for transforming assessment practice and research.
pdf
bib
abs
Long context Automated Essay Scoring with Language Models
Christopher Ormerod
|
Gitit Kehat
In this study, we evaluate several models that incorporate architectural modifications to overcome the length limitations of the standard transformer architecture using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.
pdf
bib
abs
Optimizing Reliability Scoring for ILSAs
Ji Yoon Jung
|
Ummugul Bezirhan
|
Matthias von Davier
This study proposes an innovative method for evaluating cross-country scoring reliability (CCSR) in multilingual assessments, using hyperparameter optimization and a similarity-based weighted majority scoring within a single human scoring framework. Results show that this approach provides a cost-effective and comprehensive assessment of CCSR without the need for additional raters.
pdf
bib
abs
Exploring AI-Enabled Test Practice, Affect, and Test Outcomes in Language Assessment
Jill Burstein
|
Ramsey Cardwell
|
Ping-Lin Chuang
|
Allison Michalowski
|
Steven Nydick
We analyzed data from 25,969 test takers of a high-stakes, computer-adaptive English proficiency test to examine relationships between repeated use of AI-generated practice tests and performance, affect, and score-sharing behavior. Taking 1–3 practice tests was associated with higher scores and confidence, while higher usage showed different engagement and outcome
pdf
bib
abs
Develop a Generic Essay Scorer for Practice Writing Tests of Statewide Assessments
Yi Gui
This study examines whether NLP transfer learning techniques, specifically BERT, can be used to develop prompt-generic AES models for practice writing tests. Findings reveal that fine-tuned DistilBERT, without further pre-training, achieves high agreement (QWK ≈ 0.89), enabling scalable, robust AES models in statewide K-12 assessments without costly supplementary pre-training.
pdf
bib
abs
Towards assessing persistence in reading in young learners using pedagogical agents
Caitlin Tenison
|
Beata Beigman Kelbanov
|
Noah Schroeder
|
Shan Zhang
|
Michael Suhan
|
Chuyang Zhang
This pilot study investigated the use of a pedagogical agent to administer a conversational survey to second graders following a digital reading activity, measuring comprehension, persistence, and enjoyment. Analysis of survey responses and behavioral log data provide evidence for recommendations for the design of agent-mediated assessment in early literacy.
pdf
bib
abs
LLM-Based Approaches for Detecting Gaming the System in Self-Explanation
Jiayi (Joyce) Zhang
|
Ryan S. Baker
|
Bruce M. McLaren
This study compares two LLM-based approaches for detecting gaming behavior in students’ open-ended responses within a math digital learning game. The sentence embedding method outperformed the prompt-based approach and was more conservative. Consistent with prior research, gaming correlated negatively with learning, highlighting LLMs’ potential to detect disengagement in open-ended tasks.
pdf
bib
abs
Evaluating the Impact of LLM-guided Reflection on Learning Outcomes with Interactive AI-Generated Educational Podcasts
Vishnu Menon
|
Andy Cherney
|
Elizabeth B. Cloude
|
Li Zhang
|
Tiffany Diem Do
This study examined whether embedding LLM-guided reflection prompts in an interactive AI-generated podcast improved learning and user experience compared to a version without prompts. Thirty-six undergraduates participated, and while learning outcomes were similar across conditions, reflection prompts reduced perceived attractiveness, highlighting a call for more research on reflective interactivity design.
pdf
bib
abs
Generative AI in the K–12 Formative Assessment Process: Enhancing Feedback in the Classroom
Mike Thomas Maksimchuk
|
Edward Roeber
|
Davie Store
This paper explores how generative AI can enhance K–12 formative assessment by improving feedback, supporting task design, fostering student metacognition, and building teacher assessment literacy. It addresses challenges of equity, ethics, and implementation, offering practical strategies and case studies to guide responsible AI integration in classroom formative assessment practices.
pdf
bib
abs
Using Large Language Models to Analyze Students’ Collaborative Argumentation in Classroom Discussions
Nhat Tran
|
Diane Litman
|
Amanda Godley
Collaborative argumentation enables students to build disciplinary knowledge and to think in disciplinary ways. We use Large Language Models (LLMs) to improve existing methods for collaboration classification and argument identification. Results suggest that LLMs are effective for both tasks and should be considered as a strong baseline for future research.
pdf
bib
abs
Evaluating Generative AI as a Mentor Resource: Bias and Implementation Challenges
Jimin Lee
|
Alena G Esposito
We explored how students’ perceptions of helpfulness and caring skew their ability to identify AI versus human mentorship responses. Emotionally resonant responses often lead to misattributions, indicating perceptual biases that shape mentorship judgments. The findings inform ethical, relational, and effective integration of AI in student support.
pdf
bib
abs
AI-Based Classification of TIMSS Items for Framework Alignment
Ummugul Bezirhan
|
Matthias von Davier
Large-scale assessments rely on expert panels to verify that test items align with prescribed frameworks, a labor-intensive process. This study evaluates the use of GPT-4o to classify TIMSS items to content domain, cognitive domain, and difficulty categories. Findings highlight the potential of language models to support scalable, framework-aligned item verification.
pdf
bib
abs
Towards Reliable Generation of Clinical Chart Items: A Counterfactual Reasoning Approach with Large Language Models
Jiaxuan Li
|
Saed Rezayi
|
Peter Baldwin
|
Polina Harik
|
Victoria Yaneva
This study explores GPT-4 for generating clinical chart items in medical education using three prompting strategies. Expert evaluations found many items usable or promising. The counterfactual approach enhanced novelty, and item quality improved with high-surprisal examples. This is the first investigation of LLMs for automated clinical chart item generation.
pdf
bib
abs
Using Whisper Embeddings for Audio-Only Latent Token Classification of Classroom Management Practices
Wesley Griffith Morris
|
Jessica Vitale
|
Isabel Arvelo
In this study, we developed a textless NLP system using a fine-tuned Whisper encoder to identify classroom management practices from noisy classroom recordings. The model segments teacher speech from non-teacher speech and performs multi-label classification of classroom practices, achieving acceptable accuracy without requiring transcript generation.
pdf
bib
abs
Comparative Study of Double Scoring Design for Measuring Mathematical Quality of Instruction
Jonathan Kyle Foster
|
James Drimalla
|
Nursultan Japashov
The integration of automated scoring and addressing whether it might meet the extensive need for double scoring in classroom observation systems is the focus of this study. We outline an accessible approach for determining the interchangeability of automated systems within comparative scoring design studies.
pdf
bib
abs
Toward Automated Evaluation of AI-Generated Item Drafts in Clinical Assessment
Tazin Afrin
|
Le An Ha
|
Victoria Yaneva
|
Keelan Evanini
|
Steven Go
|
Kristine DeRuchie
|
Michael Heilig
This study examines the classification of AI-generated clinical multiple-choice questions drafts as “helpful” or “non-helpful” starting points. Expert judgments were analyzed, and multiple classifiers were evaluated—including feature-based models, fine-tuned transformers, and few-shot prompting with GPT-4. Our findings highlight the challenges and considerations for evaluation methods of AI-generated items in clinical test development.
pdf
bib
abs
Numeric Information in Elementary School Texts Generated by LLMs vs Human Experts
Anastasia Smirnova
|
Erin S. Lee
|
Shiying Li
We analyze GPT-4o’s ability to represent numeric information in texts for elementary school children and assess it with respect to the human baseline. We show that both humans and GPT-4o reduce the amount of numeric information when adapting informational texts for children but GPT-4o retains more complex numeric types than humans do.
pdf
bib
abs
Towards evaluating teacher discourse without task-specific fine-tuning data
Beata Beigman Klebanov
|
Michael Suhan
|
Jamie N. Mikeska
Teaching simulations with feedback are one way to provide teachers with practice opportunities to help improve their skill. We investigated methods to build evaluation models of teacher performance in leading a discussion in a simulated classroom, particularly for tasks with little performance data.
pdf
bib
abs
Linguistic proficiency of humans and LLMs in Japanese: Effects of task demands and content
May Lynn Reese
|
Anastasia Smirnova
We evaluate linguistic proficiency of humans and LLMs on pronoun resolution in Japanese, using the Winograd Schema Challenge dataset. Humans outperform LLMs in the baseline condition, but we find evidence for task demand effectss in both humans and LLMs. We also found that LLMs surpass human performance in scenarios referencing US culture, providing strong evidence for content effects.
pdf
bib
abs
Generative AI Teaching Simulations as Formative Assessment Tools within Preservice Teacher Preparation
Jamie N. Mikeska
|
Aakanksha Bhatia
|
Shreyashi Halder
|
Tricia Maxwell
|
Beata Beigman Klebanov
|
Benny Longwill
|
Kashish Behl
|
Calli Shekell
This paper examines how generative AI (GenAI) teaching simulations can be used as a formative assessment tool to gain insight into elementary preservice teachers’ (PSTs’) instructional abilities. This study investigated the teaching moves PSTs used to elicit student thinking in a GenAI simulation and their perceptions of the simulation’s
pdf
bib
abs
Using LLMs to identify features of personal and professional skills in an open-response situational judgment test
Cole Walsh
|
Rodica Ivan
|
Muhammad Zafar Iqbal
|
Colleen Robb
Current methods for assessing personal and professional skills lack scalability due to reliance on human raters, while NLP-based systems for assessing these skills fail to demonstrate construct validity. This study introduces a new method utilizing LLMs to extract construct-relevant features from responses to an assessment of personal and professional skills.
pdf
bib
abs
Automated Evaluation of Standardized Patients with LLMs
Andrew Emerson
|
Le An Ha
|
Keelan Evanini
|
Su Somay
|
Kevin Frome
|
Polina Harik
|
Victoria Yaneva
Standardized patients (SPs) are essential for clinical reasoning assessments in medical education. This paper introduces evaluation metrics that apply to both human and simulated SP systems. The metrics are computed using two LLM-as-a-judge approaches that align with human evaluators on SP performance, enabling scalable formative clinical reasoning assessments.
pdf
bib
abs
LLM-Human Alignment in Evaluating Teacher Questioning Practices: Beyond Ratings to Explanation
Ruikun Hou
|
Tim Fütterer
|
Babette Bühler
|
Patrick Schreyer
|
Peter Gerjets
|
Ulrich Trautwein
|
Enkelejda Kasneci
This study investigates the alignment between large language models (LLMs) and human raters in assessing teacher questioning practices, moving beyond rating agreement to the evidence selected to justify their decisions. Findings highlight LLMs’ potential to support large-scale classroom observation through interpretable, evidence-based scoring, with possible implications for concrete teacher feedback.
pdf
bib
abs
Leveraging Fine-tuned Large Language Models in Item Parameter Prediction
Suhwa Han
|
Frank Rijmen
|
Allison Ames Boykin
|
Susan Lottridge
The study introduces novel approaches for fine-tuning pre-trained LLMs to predict item response theory parameters directly from item texts and structured item attribute variables. The proposed methods were evaluated on a dataset over 1,000 English Language Art items that are currently in the operational pool for a large scale assessment.
pdf
bib
abs
How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment
Julie Jung
|
Max Lu
|
Sina Chole Benker
|
Dogus Darici
We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment with human raters in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Findings reveal both the potential for scalable LLM-raters and the risks of relying on them exclusively.
pdf
bib
abs
Assessing AI skills: A washback point of view
Meirav Arieli-Attali
|
Beata Beigman Klebanov
|
Tenaha O’Reilly
|
Diego Zapata-Rivera
|
Tami Sabag-Shushan
|
Iman Awadie
The emerging dominance of AI in the perception of skills-of-the-future makes assessing AI skills necessary to help guide learning. Creating an assessment of AI skills poses some new challenges. We examine those from the point of view of washback, and exemplify using two exploration studies conducted with 9th grade students.
pdf
bib
abs
Using Generative AI to Develop a Common Metric in Item Response Theory
Peter Baldwin
We propose a method for linking independently calibrated item response theory (IRT) scales using large language models to generate shared parameter estimates across forms. Applied to medical licensure data, the approach reliably recovers slope values across all conditions and yields accurate intercepts when cross-form differences in item difficulty are small.
pdf
bib
abs
Augmented Measurement Framework for Dynamic Validity and Reciprocal Human-AI Collaboration in Assessment
Taiwo Feyijimi
|
Daniel O Oyeniran
|
Oukayode Apata
|
Henry Sanmi Makinde
|
Hope Oluwaseun Adegoke
|
John Ajamobe
|
Justice Dadzie
The proliferation of Generative Artificial Intelligence presents unprecedented opportunities and profound challenges for educational measurement. This study introduces the Augmented Measurement Framework grounded in four core principles. The paper discussed practical applications, implications for professional development and policy, and charts a research agenda for advancing this framework in educational measurement.
pdf
bib
abs
Patterns of Inquiry, Scaffolding, and Interaction Profiles in Learner-AI Collaborative Math Problem-Solving
Zilong Pan
|
Shen Ba
|
Zilu Jiang
|
Chenglu Li
This study investigates inquiry and scaffolding patterns between students and MathPal, a math AI agent, during problem-solving tasks. Using qualitative coding, lag sequential analysis, and Epistemic Network Analysis, the study identifies distinct interaction profiles, revealing how personalized AI feedback shapes student learning behaviors and inquiry dynamics in mathematics problem-solving activities.
pdf
bib
abs
Pre-trained Transformer Models for Standard-to-Standard Alignment Study
Hye-Jeong Choi
|
Reese Butterfuss
|
Meng Fan
The current study evaluated the accuracy of five pre-trained large language models (LLMs) in matching human judgment for standard-to-standard alignment study. Results demonstrated comparable performance LLMs across despite differences in scale and computational demands. Additionally, incorporating domain labels as auxiliary information did not enhance LLMs performance. These findings provide initial evidence for the viability of open-source LLMs to facilitate alignment study and offer insights into the utility of auxiliary information.
pdf
bib
abs
From Entropy to Generalizability: Strengthening Automated Essay Scoring Reliability and Sustainability
Yi Gui
Generalizability Theory with entropy-derived stratification optimized automated essay scoring reliability. A G-study decomposed variance across 14 encoders and 3 seeds; D-studies identified minimal ensembles achieving G ≥ 0.85. A hybrid of one medium and one small encoder with two seeds maximized dependability per compute cost. Stratification ensured uniform precision across
pdf
bib
abs
Undergraduate Students’ Appraisals and Rationales of AI Fairness in Higher Education
Victoria Delaney
|
Sunday Stein
|
Lily Sawi
|
Katya Hernandez Holliday
To measure learning with AI, students must be afforded opportunities to use AI consistently across courses. Our interview study of 36 undergraduates revealed that students make independent appraisals of AI fairness amid school policies and use AI inconsistently on school assignments. We discuss tensions for measurement raised from students’ responses.
pdf
bib
abs
AI-Generated Formative Practice and Feedback: Performance Benchmarks and Applications in Higher Education
Rachel van Campenhout
|
Michelle Weaver Clark
|
Jeffrey S. Dittel
|
Bill Jerome
|
Nick Brown
|
Benny Johnson
Millions of AI-generated formative practice questions across thousands of publisher etextbooks are available for student use in higher education. We review the research to address both performance metrics for questions and feedback calculated from student data, and discuss the importance of successful applications in the classroom to maximize learning potential.
pdf
bib
abs
Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation
Danielle R Thomas
|
Conrad Borchers
|
Ken Koedinger
Humans are biased, inconsistent, and yet we keep trusting them to define “ground truth.” This paper questions the overreliance on inter-rater reliability in educational AI and proposes a multidimensional approach leveraging expert-based approaches and close-the-loop validity to build annotations that reflect impact, not just agreement. It’s time we do better.
pdf
bib
abs
Automated search algorithm for optimal generalized linear mixed models (GLMMs)
Miryeong Koo
|
Jinming Zhang
Only a limited number of predictors can be included in a generalized linear mixed model (GLMM) due to estimation algorithm divergence. This study aims to propose a machine learning based algorithm (e.g., random forest) that can consider all predictors without the convergence issue and automatically searches for the optimal GLMMs.
pdf
bib
abs
Exploring the Psychometric Validity of AI-Generated Student Responses: A Study on Virtual Personas’ Learning Motivation
Huanxiao Wang
This study explores whether large language models (LLMs) can simulate valid student responses for educational measurement. Using GPT-4o, 2000 virtual student personas were generated. Each persona completed the Academic Motivation Scale (AMS). Factor analyses(EFA and CFA) and clustering showed GPT-4o reproduced the AMS structure and distinct motivational subgroups.
pdf
bib
abs
Measuring Teaching with LLMs
Michael Hardy
This paper introduces custom Large Language Models using sentence-level embeddings to measure teaching quality. The models achieve human-level performance in analyzing classroom transcripts, outperforming average human rater correlation. Aggregate model scores align with student learning outcomes, establishing a powerful new methodology for scalable teacher feedback. Important limitations discussed.
pdf
bib
abs
Simulating Rating Scale Responses with LLMs for Early-Stage Item Evaluation
Onur Demirkaya
|
Hsin-Ro Wei
|
Evelyn Johnson
This study explores the use of large language models to simulate human responses to Likert-scale items. A DeBERTa-base model fine-tuned with item text and examinee ability emulates a graded response model (GRM). High alignment with GRM probabilities and reasonable threshold recovery support LLMs as scalable tools for early-stage item evaluation.
pdf
bib
abs
Bias and Reliability in AI Safety Assessment: Multi-Facet Rasch Analysis of Human Moderators
Chunling Niu
|
Kelly Bradley
|
Biao Ma
|
Brian Waltman
|
Loren Cossette
|
Rui Jin
Using Multi-Facet Rasch Modeling on 36,400 safety ratings of AI-generated conversations, we reveal significant racial disparities (Asian 39.1%, White 28.7% detection rates) and content-specific bias patterns. Simulations show that diverse teams of 8-10 members achieve 70%+ reliability versus 62% for smaller homogeneous teams, providing evidence-based guidelines for AI-generated content moderation.
pdf
bib
abs
Dynamic Bayesian Item Response Model with Decomposition (D-BIRD): Modeling Cohort and Individual Learning Over Time
Hansol Lee
|
Jason B. Cho
|
David S. Matteson
|
Benjamin Domingue
We present D-BIRD, a Bayesian dynamic item response model for estimating student ability from sparse, longitudinal assessments. By decomposing ability into a cohort trend and individual trajectory, D-BIRD supports interpretable modeling of learning over time. We evaluate parameter recovery in simulation and demonstrate the model using real-world personalized learning data.
pdf
bib
abs
Enhancing Essay Scoring with GPT-2 Using Back Translation Techniques
Aysegul Gunduz
|
Mark Gierl
|
Okan Bulut
This study evaluates GPT-2 (small) for automated essay scoring on the ASAP dataset. Back-translation (English–Turkish–English) improved performance, especially on imbalanced sets. QWK scores peaked at 0.77. Findings highlight augmentation’s value and the need for more advanced, rubric-aware models for fairer assessment.
pdf
bib
abs
Mathematical Computation and Reasoning Errors by Large Language Models
Liang Zhang
|
Edith Graf
We evaluate four LLMs (GPT-4o, o1, DeepSeek-V3, DeepSeek-R1) on purposely challenging arithmetic, algebra, and number-theory items. Coding final answers and step-level solutions correctness reveals performance gaps, improvement paths, and how accurate LLMs can strengthen mathematics assessment and instruction.
uppdf
bib
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress
Joshua Wilson
|
Christopher Ormerod
|
Magdalen Beiting Parrish
pdf
bib
abs
Automated Item Neutralization for Non-Cognitive Scales: A Large Language Model Approach to Reducing Social-Desirability Bias
Sirui Wu
|
Daijin Yang
This study explores an AI-assisted approach for rewriting personality scale items to reduce social desirability bias. Using GPT-refined neutralized items based on the IPIP-BFM-50, we compare factor structures, item popularity, and correlations with the MC-SDS to evaluate construct validity and the effectiveness of AI-based item refinement in Chinese contexts.
pdf
bib
abs
AI as a Mind Partner: Cognitive Impact in Pakistan’s Educational Landscape
Eman Khalid
|
Hammad Javaid
|
Yashal Waseem
|
Natasha Sohail Barlas
This study explores how high school and university students in Pakistan perceive and use generative AI as a cognitive extension. Drawing on the Extended Mind Theory, impact on critical thinking, and ethics are evaluated. Findings reveal over-reliance, mixed emotional responses, and institutional uncertainty about AI’s role in learning.
pdf
bib
abs
Detecting Math Misconceptions: An AI Benchmark Dataset
Bethany Rittle-Johnson
|
Rebecca Adler
|
Kelley Durkin
|
L Burleigh
|
Jules King
|
Scott Crossley
To harness the promise of AI for improving math education, AI models need to be able to diagnose math misconceptions. We created an AI benchmark dataset on math misconceptions and other instructionally-relevant errors, comprising over 52,000 explanations written over 15 math questions that were scored by expert human raters.
pdf
bib
abs
Optimizing Opportunity: An AI-Driven Approach to Redistricting for Fairer School Funding
Jordan Abbott
We address national educational inequity driven by school district boundaries using a comparative AI framework. Our models, which redraw boundaries from scratch or consolidate existing districts, generate evidence-based plans that reduce funding and segregation disparities, offering policymakers scalable, data-driven solutions for systemic reform.
pdf
bib
abs
Automatic Grading of Student Work Using Simulated Rubric-Based Data and GenAI Models
Yiyao Yang
|
Yasemin Gulbahar
Grading assessment in data science faces challenges related to scalability, consistency, and fairness. Synthetic dataset and GenAI enable us to simulate realistic code samples and automatically evaluate using rubric-driven systems. The research proposes an automatic grading system for generated Python code samples and explores GenAI grading reliability through human-AI comparison.
pdf
bib
abs
Cognitive Engagement in GenAI Tutor Conversations: At-scale Measurement and Impact on Learning
Kodi Weatherholtz
|
Kelli Millwood Hill
|
Kristen Dicerbo
|
Walt Wells
|
Phillip Grimaldi
|
Maya Miller-Vedam
|
Charles Hogg
|
Bogdan Yamkovenko
We developed and validated a scalable LLM-based labeler for classifying student cognitive engagement in GenAI tutoring conversations. Higher engagement levels predicted improved next-item performance, though further research is needed to assess distal transfer and to disentangle effects of continued tutor use from true learning transfer.
pdf
bib
abs
Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing
Tianwen Li
|
Michelle Hong
|
Lindsay Clare Matsumura
|
Elaine Lin Wang
|
Diane Litman
|
Zhexiong Liu
|
Richard Correnti
This study explores the use of ChatGPT-4.1 as a formative assessment tool for identifying revision patterns in young adolescents’ argumentative writing. ChatGPT-4.1 shows moderate agreement with human coders on identifying evidence-related revision patterns and fair agreement on explanation-related ones. Implications for LLM-assisted formative assessment of young adolescent writing are discussed.
pdf
bib
abs
Predicting and Evaluating Item Responses Using Machine Learning, Text Embeddings, and LLMs
Evelyn Johnson
|
Hsin-Ro Wei
|
Tong Wu
|
Huan Liu
This work-in-progress study compares the accuracy of machine learning and large language models to predict student responses to field-test items on a social-emotional learning assessment. We evaluate how well each method replicates actual responses and examine the item parameters generated by synthetic data to those derived from actual student data.
pdf
bib
abs
Evaluating LLM-Based Automated Essay Scoring: Accuracy, Fairness, and Validity
Yue Huang
|
Joshua Wilson
This study evaluates large language models (LLMs) for automated essay scoring (AES), comparing prompt strategies and fairness across student groups. We found that well-designed prompting helps LLMs approach traditional AES performance, but both differ from human scores for ELLs—the traditional model shows larger overrall gaps, while LLMs show subtler disparities.
pdf
bib
abs
Comparing AI tools and Human Raters in Predicting Reading Item Difficulty
Hongli Li
|
Roula Aldib
|
Chad Marchong
|
Kevin Fan
This study compares AI tools and human raters in predicting the difficulty of reading comprehension items without response data. Predictions from AI models (ChatGPT, Gemini, Claude, and DeepSeek) and human raters are evaluated against empirical difficulty values derived from student responses. Findings will inform AI’s potential to support test development.
pdf
bib
abs
When Machines Mislead: Human Review of Erroneous AI Cheating Signals
William Belzak
|
Chenhao Niu
|
Angel Ortmann Lee
This study examines how human proctors interpret AI-generated alerts for misconduct in remote assessments. Findings suggest proctors can identify false positives, though confirmation bias and differences across test-taker nationalities were observed. Results highlight opportunities to refine proctoring guidelines and strengthen fairness in human oversight of automated signals in high-stakes testing.
pdf
bib
abs
Fairness in Formative AI: Cognitive Complexity in Chatbot Questions Across Research Topics
Alexandra Barry Colbert
|
Karen D Wang
This study evaluates whether questions generated from a socratic-style research AI chatbot designed to support project-based AP courses maintains cognitive complexity parity when inputted with research topics of controversial and non-controversial nature. We present empirical findings indicating no significant conversational complexity differences, highlighting implications for equitable AI use in formative assessment.
pdf
bib
abs
Keystroke Analysis in Digital Test Security: AI Approaches for Copy-Typing Detection and Cheating Ring Identification
Chenhao Niu
|
Yong-Siang Shih
|
Manqian Liao
|
Ruidong Liu
|
Angel Ortmann Lee
This project leverages AI-based analysis of keystroke and mouse data to detect copy-typing and identify cheating rings in the Duolingo English Test. By modeling behavioral biometrics, the approach provides actionable signals to proctors, enhancing digital test security for large-scale online assessment.
pdf
bib
abs
Talking to Learn: A SoTL Study of Generative AI-Facilitated Feynman Reviews
Madeline Rose Mattox
|
Natalie Hutchins
|
Jamie J Jirout
Structured Generative AI interactions have potential for scaffolding learning. This Scholarship of Teaching and Learning study analyzes 16 undergraduate students’ Feynman-style AI interactions (N=157) across a semester-long child-development course. Qualitative coding of the interactions explores engagement patterns, metacognitive support, and response consistency, informing ethical AI integration in higher education.
pdf
bib
abs
AI-Powered Coding of Elementary Students’ Small-Group Discussions about Text
Carla Firetto
|
P. Karen Murphy
|
Lin Yan
|
Yue Tang
We report reliability and validity evidence for an AI-powered coding of 371 small-group discussion transcripts. Evidence via comparability and ground truth checks suggested high consistency between AI-produced and human-produced codes. Research in progress is also investigating reliability and validity of a new “quality” indicator to complement the current coding.
pdf
bib
abs
Evaluating the Reliability of Human–AI Collaborative Scoring of Written Arguments Using Rational Force Model
Noriko Takahashi
|
Abraham Onuorah
|
Alina Reznitskaya
|
Evgeny Chukharev
|
Ariel Sykes
|
Michele Flammia
|
Joe Oyler
This study aims to improve the reliability of a new AI collaborative scoring system used to assess the quality of students’ written arguments. The system draws on the Rational Force Model and focuses on classifying the functional relation of each proposition in terms of support, opposition, acceptability, and relevance.
pdf
bib
abs
Evaluating Deep Learning and Transformer Models on SME and GenAI Items
Joe Betts
|
William Muntean
This study leverages deep learning, transformer models, and generative AI to streamline test development by automating metadata tagging and item generation. Transformer models outperform simpler approaches, reducing SME workload. Ongoing research refines complex models and evaluates LLM-generated items, enhancing efficiency in test creation.
pdf
bib
abs
Comparison of AI and Human Scoring on A Visual Arts Assessment
Ning Jiang
|
Yue Huang
|
Jie Chen
This study examines reliability and comparability of Generative AI scores versus human ratings on two performance tasks—text-based and drawing-based—in a fourth-grade visual arts assessment. Results show GPT-4 is consistent, aligned with humans but more lenient, and its agreement with humans is slightly lower than that between human raters.
pdf
bib
abs
Explainable Writing Scores via Fine-grained, LLM-Generated Features
James V Bruno
|
Lee Becker
Advancements in deep learning have enhanced Automated Essay Scoring (AES) accuracy but reduced interpretability. This paper investigates using LLM-generated features to train an explainable scoring model. By framing feature engineering as prompt engineering, state-of-the-art language technology can be integrated into simpler, more interpretable AES models.
pdf
bib
abs
Validating Generative AI Scoring of Constructed Responses with Cognitive Diagnosis
Hyunjoo Kim
This research explores the feasibility of applying the cognitive diagnosis assessment (CDA) framework to validate generative AI-based scoring of constructed responses (CRs). The classification information of CRs and item-parameter estimates from cognitive diagnosis models (CDMs) could provide additional validity evidence for AI-generated CR scores and feedback.
pdf
bib
abs
Automated Diagnosis of Students’ Number Line Strategies for Fractions
Zhizhi Wang
|
Dake Zhang
|
Min Li
|
Yuhan Tao
This study aims to develop and evaluate an AI-based platform that automatically grade and classify problem-solving strategies and error types in students’ handwritten fraction representations involving number lines. The model development procedures, and preliminary evaluation results comparing with available LLMs and human expert annotations are reported.
pdf
bib
abs
Medical Item Difficulty Prediction Using Machine Learning
Hope Oluwaseun Adegoke
|
Ying Du
|
Andrew Dwyer
This project aims to use machine learning models to predict a medical exam item difficulty by combining item metadata, linguistic features, word embeddings, and semantic similarity measures with a sample size of 1000 items. The goal is to improve the accuracy of difficulty prediction in medical assessment.
pdf
bib
abs
Examining decoding items using engine transcriptions and scoring in early literacy assessment
Zachary Schultz
|
Mackenzie Young
|
Debbie Dugdale
|
Susan Lottridge
We investigate the reliability of two scoring approaches to early literacy decoding items, whereby students are shown a word and asked to say it aloud. Approaches were rubric scoring of speech, human or AI transcription with varying explicit scoring rules. Initial results suggest rubric-based approaches perform better than transcription-based methods.
pdf
bib
abs
Addressing Few-Shot LLM Classification Instability Through Explanation-Augmented Distillation
William Muntean
|
Joe Betts
This study compares explanation-augmented knowledge distillation with few-shot in-context learning for LLM-based exam question classification. Fine-tuned smaller language models achieved competitive performance with greater consistency than large mode few-shot approaches, which exhibited notable variability across different examples. Hyperparameter selection proved essential, with extremely low learning rates significantly impairing model performance.
pdf
bib
abs
Identifying Biases in Large Language Model Assessment of Linguistically Diverse Texts
Lionel Hsien Meng
|
Shamya Karumbaiah
|
Vivek Saravanan
|
Daniel Bolt
The development of Large Language Models (LLMs) to assess student text responses is rapidly progressing but evaluating whether LLMs equitably assess multilingual learner responses is an important precursor to adoption. Our study provides an example procedure for identifying and quantifying bias in LLM assessment of student essay responses.
pdf
bib
abs
Implicit Biases in Large Vision–Language Models in Classroom Contexts
Peter Baldwin
Using a counterfactual, adversarial, audit-style approach, we tested whether ChatGPT-4o evaluates classroom lectures differently based on teacher demographics. The model was told only to rate lecture excerpts embedded within classroom images—without reference to the images themselves. Despite this, ratings varied systematically by teacher race and sex, revealing implicit bias.
pdf
bib
abs
Enhancing Item Difficulty Prediction in Large-scale Assessment with Large Language Model
Mubarak Mojoyinola
|
Olasunkanmi James Kehinde
|
Judy Tang
Field testing is a resource-intensive bottleneck in test development. This study applied an interpretable framework that leverages a Large Language Model (LLM) for structured feature extraction from TIMSS items. These features will train several classifiers, whose predictions will be explained using SHAP, providing actionable, diagnostic insights insights for item writers.
pdf
bib
abs
Leveraging LLMs for Cognitive Skill Mapping in TIMSS Mathematics Assessment
Ruchi J Sachdeva
|
Jung Yeon Park
This study evaluates ChatGPT-4’s potential to support validation of Q-matrices and analysis of complex skill–item interactions. By comparing its outputs to expert benchmarks, we assess accuracy, consistency, and limitations, offering insights into how large language models can augment expert judgment in diagnostic assessment and cognitive skill mapping.
uppdf
bib
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers
Joshua Wilson
|
Christopher Ormerod
|
Magdalen Beiting Parrish
pdf
bib
abs
When Does Active Learning Actually Help? Empirical Insights with Transformer-based Automated Scoring
Justin O Barber
|
Michael P. Hemenway
|
Edward Wolfe
Developing automated essay scoring (AES) systems typically demands extensive human annotation, incurring significant costs and requiring considerable time. Active learning (AL) methods aim to alleviate this challenge by strategically selecting the most informative essays for scoring, thereby potentially reducing annotation requirements without compromising model accuracy. This study systematically evaluates four prominent AL strategies—uncertainty sampling, BatchBALD, BADGE, and a novel GenAI-based uncertainty approach—against a random sampling baseline, using DeBERTa-based regression models across multiple assessment prompts exhibiting varying degrees of human scorer agreement. Contrary to initial expectations, we found that AL methods provided modest but meaningful improvements only for prompts characterized by poor scorer reliability (<60% agreement per score point). Notably, extensive hyperparameter optimization alone substantially reduced the annotation budget required to achieve near-optimal scoring performance, even with random sampling. Our findings underscore that while targeted AL methods can be beneficial in contexts of low scorer reliability, rigorous hyperparameter tuning remains a foundational and highly effective strategy for minimizing annotation costs in AES system development.
pdf
bib
abs
Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems
Christopher Ormerod
This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations – a generative language model used for spell correction and an encoder-based token-classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers.
pdf
bib
abs
Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests
Yanbin Fu
|
Hong Jiao
|
Tianyi Zhou
|
Nan Zhang
|
Ming Li
|
Qingshu Xu
|
Sydney Peters
|
Robert W Lissitz
Aligning test items to content standards is a critical step in test development to collect validity evidence 3 based on content. Item alignment has typically been conducted by human experts, but this judgmental process can be subjective and time-consuming. This study investigated the performance of fine-tuned small language models (SLMs) for automated item alignment using data from a large-scale standardized reading and writing test for college admissions. Different SLMs were trained for both domain and skill alignment. The model performance was evaluated using precision, recall, accuracy, weighted F1 score, and Cohen’s kappa on two test sets. The impact of input data types and training sample sizes was also explored. Results showed that including more textual inputs led to better performance gains than increasing sample size. For comparison, classic supervised machine learning classifiers were trained on multilingual-E5 embedding. Fine-tuned SLMs consistently outperformed these models, particularly for fine-grained skill alignment. To better understand model classifications, semantic similarity analyses including cosine similarity, Kullback-Leibler divergence of embedding distributions, and two-dimension projections of item embedding revealed that certain skills in the two test datasets were semantically too close, providing evidence for the observed misclassification patterns.
pdf
bib
abs
Review of Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments
Sydney Peters
|
Nan Zhang
|
Hong Jiao
|
Ming Li
|
Tianyi Zhou
Item difficulty plays a crucial role in evaluating item quality, test form assembly, and interpretation of scores in large-scale assessments. Traditional approaches to estimate item difficulty rely on item response data collected in field testing, which can be time-consuming and costly. To overcome these challenges, text-based approaches leveraging machine learning and natural language processing have emerged as promising alternatives. This paper reviews and synthesizes 37 articles on automated item difficulty prediction in large-scale assessments. Each study is synthesized in terms of the dataset, difficulty parameter, subject domain, item type, number of items, training and test data split, input, features, model, evaluation criteria, and model performance outcomes. Overall, text-based models achieved moderate to high predictive performance, highlighting the potential of text-based item difficulty modeling to enhance the current practices of item quality evaluation.
pdf
bib
abs
Item Difficulty Modeling Using Fine-Tuned Small and Large Language Models
Ming Li
|
Hong Jiao
|
Tianyi Zhou
|
Nan Zhang
|
Sydney Peters
|
Robert W Lissitz
This study investigates methods for item difficulty modeling in large-scale assessments using both small and large language models. We introduce novel data augmentation strategies, including on-the-fly augmentation and distribution balancing, that surpass benchmark performances, demonstrating their effectiveness in mitigating data imbalance and improving model performance. Our results showed that fine-tuned small language models such as BERT and RoBERTa yielded lower root mean squared error than the first-place winning model in the BEA 2024 Shared Task competition, whereas domain-specific models like BioClinicalBERT and PubMedBERT did not provide significant improvements due to distributional gaps. Majority voting among small language models enhanced prediction accuracy, reinforcing the benefits of ensemble learning. Large language models (LLMs), such as GPT-4, exhibited strong generalization capabilities but struggled with item difficulty prediction, likely due to limited training data and the absence of explicit difficulty-related context. Chain-of-thought prompting and rationale generation approaches were explored but did not yield substantial improvements, suggesting that additional training data or more sophisticated reasoning techniques may be necessary. Embedding-based methods, particularly using NV-Embed-v2, showed promise but did not outperform our best augmentation strategies, indicating that capturing nuanced difficulty-related features remains a challenge.
pdf
bib
abs
Operational Alignment of Confidence-Based Flagging Methods in Automated Scoring
Corey Palermo
|
Troy Chen
|
Arianto Wibowo
Correct answers to math problems don’t reveal if students understand concepts or just memorized procedures. Conversation-Based Assessment (CBA) addresses this through AI dialogue, but reliable scoring requires costly pilots and specialized expertise. Our Criteria Development Platform (CDP) enables pre-pilot optimization using synthetic data, reducing development from months to days. Testing 17 math items through 68 iterations, all achieved our reliability threshold (MCC ≥ 0.80) after refinement – up from 59% initially. Without refinement, 7 items would have remained below this threshold. By making reliability validation accessible, CDP empowers educators to develop assessments meeting automated scoring standards.
pdf
bib
abs
Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data
Tyler Burleigh
|
Jing Chen
|
Kristen Dicerbo
Story retell assessments provide valuable insights into reading comprehension but face implementation barriers due to time-intensive administration and scoring. This study examines whether Large Language Models (LLMs) can reliably replicate human judgment in grading story retells. Using a novel dataset, we conduct three complementary studies examining LLM performance across different rubric systems, agreement patterns, and reasoning alignment. We find that LLMs (a) achieve near-human reliability with appropriate rubric design, (b) perform well on easy-to-grade cases but poorly on ambiguous ones, (c) produce explanations for their grades that are plausible for straightforward cases but unreliable for complex ones, and (d) different LLMs display consistent “grading personalities” (systematically scoring harder or easier across all student responses). These findings support hybrid assessment architectures where AI handles routine scoring, enabling more frequent formative assessment while directing teacher expertise toward students requiring nuanced support.
pdf
bib
abs
When Humans Can’t Agree, Neither Can Machines: The Promise and Pitfalls of LLMs for Formative Literacy Assessment
Owen Henkel
|
Kirk Vanacore
|
Bill Roberts
Story retell assessments provide valuable insights into reading comprehension but face implementation barriers due to time-intensive administration and scoring. This study examines whether Large Language Models (LLMs) can reliably replicate human judgment in grading story retells. Using a novel dataset, we conduct three complementary studies examining LLM performance across different rubric systems, agreement patterns, and reasoning alignment. We find that LLMs (a) achieve near-human reliability with appropriate rubric design, (b) perform well on easy-to-grade cases but poorly on ambiguous ones, (c) produce explanations for their grades that are plausible for straightforward cases but unreliable for complex ones, and (d) different LLMs display consistent “grading personalities” (systematically scoring harder or easier across all student responses). These findings support hybrid assessment architectures where AI handles routine scoring, enabling more frequent formative assessment while directing teacher expertise toward students requiring nuanced support.
pdf
bib
abs
Beyond the Hint: Using Self-Critique to Constrain LLM Feedback in Conversation-Based Assessment
Tyler Burleigh
|
Jenny Han
|
Kristen Dicerbo
Large Language Models in Conversation-Based Assessment tend to provide inappropriate hints that compromise validity. We demonstrate that self-critique – a simple prompt engineering technique – effectively constrains this behavior.Through two studies using synthetic conversations and real-world high school math pilot data, self-critique reduced inappropriate hints by 90.7% and 24-75% respectively. Human experts validated ground truth labels while LLM judges enabled scale. This immediately deployable solution addresses the critical tension in intermediate-stakes assessment: maintaining student engagement while ensuring fair comparisons. Our findings show prompt engineering can meaningfully safeguard assessment integrity without model fine-tuning.
pdf
bib
abs
Investigating Adversarial Robustness in LLM-based AES
Renjith Ravindran
|
Ikkyu Choi
Automated Essay Scoring (AES) is one of the most widely studied applications of Natural Language Processing (NLP) in education and educational measurement. Recent advances with pre-trained Transformer-based large language models (LLMs) have shifted AES from feature-based modeling to leveraging contextualized language representations. These models provide rich semantic representations that substantially improve scoring accuracy and human–machine agreement compared to systems relying on handcrafted features. However, their robustness towards adversarially crafted inputs remains poorly understood. In this study, we define adversarial input as any modification of the essay text designed to fool an automated scoring system into assigning an inflated score. We evaluate a fine-tuned DeBERTa-based AES model on such inputs and show that it is highly susceptible to a simple text duplication attack, highlighting the need to consider adversarial robustness alongside accuracy in the development of AES systems.
pdf
bib
abs
Effects of Generation Model on Detecting AI-generated Essays in a Writing Test
Jiyun Zu
|
Michael Fauss
|
Chen Li
Various detectors have been developed to detect AI-generated essays using labeled datasets of human-written and AI-generated essays, with many reporting high detection accuracy. In real-world settings, essays may be generated by models different from those used to train the detectors. This study examined the effects of generation model on detector performance. We focused on two generation models – GPT-3.5 and GPT-4 – and used writing items from a standardized English proficiency test. Eight detectors were built and evaluated. Six were trained on three training sets (human-written essays combined with either GPT-3.5-generated essays, or GPT-4-generated essays, or both) using two training approaches (feature-based machine learning and fine-tuning RoBERTa), and the remaining two were ensemble detectors. Results showed that a) fine-tuned detectors outperformed feature-based machine learning detectors on all studied metrics; b) detectors trained with essays generated from only one model were more likely to misclassify essays generated by the other model as human-written essays (false negatives), but did not misclassify more human-written essays as AI-generated (false positives); c) the ensemble fine-tuned RoBERTa detector had fewer false positives, but slightly more false negatives than detectors trained with essays generated by both models.
pdf
bib
abs
Exploring the Interpretability of AI-Generated Response Detection with Probing
Ikkyu Choi
|
Jiyun Zu
Multiple strategies for AI-generated response detection have been proposed, with many high-performing ones built on language models. However, the decision-making processes of these detectors remain largely opaque. We addressed this knowledge gap by fine-tuning a language model for the detection task and applying probing techniques using adversarial examples. Our adversarial probing analysis revealed that the fine-tuned model relied heavily on a narrow set of lexical cues in making the classification decision. These findings underscore the importance of interpretability in AI-generated response detectors and highlight the value of adversarial probing as a tool for exploring model interpretability.
pdf
bib
abs
A Fairness-Promoting Detection Objective With Applications in AI-Assisted Test Security
Michael Fauss
|
Ikkyu Choi
A detection objective based on bounded group-wise false alarm rates is proposed to promote fairness in the context of test fraud detection. The paper begins by outlining key aspects and characteristics that distinguish fairness in test security from fairness in other domains and machine learning in general. The proposed detection objective is then introduced, the corresponding optimal detection policy is derived, and the implications of the results are examined in light of the earlier discussion. A numerical example using synthetic data illustrates the proposed detector and compares its properties to those of a standard likelihood ratio test.
pdf
bib
abs
The Impact of an NLP-Based Writing Tool on Student Writing
Karthik Sairam
|
Amy Burkhardt
|
Susan Lottridge
We present preliminary evidence on the impact of a NLP-based writing feedback tool, Write-On with Cambi! on students’ argumentative writing. Students were randomly assigned to receive access to the tool or not, and their essay scores were compared across three rubric dimensions; estimated effect sizes (Cohen’s d) ranged from 0.25 to 0.26 (with notable variation in the average treatment effect across classrooms). To characterize and compare the groups’ writing processes, we implemented an algorithm that classified each revision as Appended (new text added to the end), Surface-level (minor within-text corrections to conventions), or Substantive (larger within-text changes or additions). We interpret within-text edits (Surface-level or Substantive) as potential markers of metacognitive engagement in revision, and note that these within-text edits are more common in students who had access to the tool. Together, these pilot analyses serve as a first step in testing the tool’s theory of action.