pdf
bib
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Joshua Wilson
|
Christopher Ormerod
|
Magdalen Beiting Parrish
pdf
bib
abs
Input Optimization for Automated Scoring in Reading Assessment
Ji Yoon Jung
|
Ummugul Bezirhan
|
Matthias von Davier
This study examines input optimization for enhanced efficiency in automated scoring (AS) of reading assessments, which typically involve lengthy passages and complex scoring guides. We propose optimizing input size using question-specific summaries and simplified scoring guides. Findings indicate that input optimization via compression is achievable while maintaining AS performance.
pdf
bib
abs
Implementation Considerations for Automated AI Grading of Student Work
Zewei Tian
|
Alex Liu
|
Lief Esbenshade
|
Shawon Sarkar
|
Zachary Zhang
|
Kevin He
|
Min Sun
19 K-12 teachers participated in a co-design pilot study of an AI education platform, testing assessment grading. Teachers valued AI’s rapid narrative feedback for formative assessment but distrusted automated scoring, preferring human oversight. Students appreciated immediate feedback but remained skeptical of AI-only grading, highlighting needs for trustworthy, teacher-centered AI tools.
pdf
bib
abs
Compare Several Supervised Machine Learning Methods in Detecting Aberrant Response Pattern
Yi Lu
|
Yu Zhang
|
Lorin Mueller
An aberrant response pattern, e.g., a test taker is able to answer difficult questions correctly, but is unable to answer easy questions correctly, are first identified lz and lz*. We then compared the performance of five supervised machine learning methods in detecting aberrant response pattern identified by lz or lz*.
pdf
bib
abs
Leveraging multi-AI agents for a teacher co-design
Hongwen Guo
|
Matthew S. Johnson
|
Luis Saldivia
|
Michelle Worthington
|
Kadriye Ercikan
This study uses multi-AI agents to accelerate teacher co-design efforts. It innovatively links student profiles obtained from numerical assessment data to AI agents in natural languages. The AI agents simulate human inquiry, enrich feedback and ground it in teachers’ knowledge and practice, showing significant potential for transforming assessment practice and research.
pdf
bib
abs
Long context Automated Essay Scoring with Language Models
Christopher Ormerod
|
Gitit Kehat
In this study, we evaluate several models that incorporate architectural modifications to overcome the length limitations of the standard transformer architecture using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.
pdf
bib
abs
Optimizing Reliability Scoring for ILSAs
Ji Yoon Jung
|
Ummugul Bezirhan
|
Matthias von Davier
This study proposes an innovative method for evaluating cross-country scoring reliability (CCSR) in multilingual assessments, using hyperparameter optimization and a similarity-based weighted majority scoring within a single human scoring framework. Results show that this approach provides a cost-effective and comprehensive assessment of CCSR without the need for additional raters.
pdf
bib
abs
Exploring AI-Enabled Test Practice, Affect, and Test Outcomes in Language Assessment
Jill Burstein
|
Ramsey Cardwell
|
Ping-Lin Chuang
|
Allison Michalowski
|
Steven Nydick
We analyzed data from 25,969 test takers of a high-stakes, computer-adaptive English proficiency test to examine relationships between repeated use of AI-generated practice tests and performance, affect, and score-sharing behavior. Taking 1–3 practice tests was associated with higher scores and confidence, while higher usage showed different engagement and outcome
pdf
bib
abs
Develop a Generic Essay Scorer for Practice Writing Tests of Statewide Assessments
Yi Gui
This study examines whether NLP transfer learning techniques, specifically BERT, can be used to develop prompt-generic AES models for practice writing tests. Findings reveal that fine-tuned DistilBERT, without further pre-training, achieves high agreement (QWK ≈ 0.89), enabling scalable, robust AES models in statewide K-12 assessments without costly supplementary pre-training.
pdf
bib
abs
Towards assessing persistence in reading in young learners using pedagogical agents
Caitlin Tenison
|
Beata Beigman Kelbanov
|
Noah Schroeder
|
Shan Zhang
|
Michael Suhan
|
Chuyang Zhang
This pilot study investigated the use of a pedagogical agent to administer a conversational survey to second graders following a digital reading activity, measuring comprehension, persistence, and enjoyment. Analysis of survey responses and behavioral log data provide evidence for recommendations for the design of agent-mediated assessment in early literacy.
pdf
bib
abs
LLM-Based Approaches for Detecting Gaming the System in Self-Explanation
Jiayi (Joyce) Zhang
|
Ryan S. Baker
|
Bruce M. McLaren
This study compares two LLM-based approaches for detecting gaming behavior in students’ open-ended responses within a math digital learning game. The sentence embedding method outperformed the prompt-based approach and was more conservative. Consistent with prior research, gaming correlated negatively with learning, highlighting LLMs’ potential to detect disengagement in open-ended tasks.
pdf
bib
abs
Evaluating the Impact of LLM-guided Reflection on Learning Outcomes with Interactive AI-Generated Educational Podcasts
Vishnu Menon
|
Andy Cherney
|
Elizabeth B. Cloude
|
Li Zhang
|
Tiffany Diem Do
This study examined whether embedding LLM-guided reflection prompts in an interactive AI-generated podcast improved learning and user experience compared to a version without prompts. Thirty-six undergraduates participated, and while learning outcomes were similar across conditions, reflection prompts reduced perceived attractiveness, highlighting a call for more research on reflective interactivity design.
pdf
bib
abs
Generative AI in the K–12 Formative Assessment Process: Enhancing Feedback in the Classroom
Mike Thomas Maksimchuk
|
Edward Roeber
|
Davie Store
This paper explores how generative AI can enhance K–12 formative assessment by improving feedback, supporting task design, fostering student metacognition, and building teacher assessment literacy. It addresses challenges of equity, ethics, and implementation, offering practical strategies and case studies to guide responsible AI integration in classroom formative assessment practices.
pdf
bib
abs
Using Large Language Models to Analyze Students’ Collaborative Argumentation in Classroom Discussions
Nhat Tran
|
Diane Litman
|
Amanda Godley
Collaborative argumentation enables students to build disciplinary knowledge and to think in disciplinary ways. We use Large Language Models (LLMs) to improve existing methods for collaboration classification and argument identification. Results suggest that LLMs are effective for both tasks and should be considered as a strong baseline for future research.
pdf
bib
abs
Evaluating Generative AI as a Mentor Resource: Bias and Implementation Challenges
Jimin Lee
|
Alena G Esposito
We explored how students’ perceptions of helpfulness and caring skew their ability to identify AI versus human mentorship responses. Emotionally resonant responses often lead to misattributions, indicating perceptual biases that shape mentorship judgments. The findings inform ethical, relational, and effective integration of AI in student support.
pdf
bib
abs
AI-Based Classification of TIMSS Items for Framework Alignment
Ummugul Bezirhan
|
Matthias von Davier
Large-scale assessments rely on expert panels to verify that test items align with prescribed frameworks, a labor-intensive process. This study evaluates the use of GPT-4o to classify TIMSS items to content domain, cognitive domain, and difficulty categories. Findings highlight the potential of language models to support scalable, framework-aligned item verification.
pdf
bib
abs
Towards Reliable Generation of Clinical Chart Items: A Counterfactual Reasoning Approach with Large Language Models
Jiaxuan Li
|
Saed Rezayi
|
Peter Baldwin
|
Polina Harik
|
Victoria Yaneva
This study explores GPT-4 for generating clinical chart items in medical education using three prompting strategies. Expert evaluations found many items usable or promising. The counterfactual approach enhanced novelty, and item quality improved with high-surprisal examples. This is the first investigation of LLMs for automated clinical chart item generation.
pdf
bib
abs
Using Whisper Embeddings for Audio-Only Latent Token Classification of Classroom Management Practices
Wesley Griffith Morris
|
Jessica Vitale
|
Isabel Arvelo
In this study, we developed a textless NLP system using a fine-tuned Whisper encoder to identify classroom management practices from noisy classroom recordings. The model segments teacher speech from non-teacher speech and performs multi-label classification of classroom practices, achieving acceptable accuracy without requiring transcript generation.
pdf
bib
abs
Comparative Study of Double Scoring Design for Measuring Mathematical Quality of Instruction
Jonathan Kyle Foster
|
James Drimalla
|
Nursultan Japashov
The integration of automated scoring and addressing whether it might meet the extensive need for double scoring in classroom observation systems is the focus of this study. We outline an accessible approach for determining the interchangeability of automated systems within comparative scoring design studies.
pdf
bib
abs
Toward Automated Evaluation of AI-Generated Item Drafts in Clinical Assessment
Tazin Afrin
|
Le An Ha
|
Victoria Yaneva
|
Keelan Evanini
|
Steven Go
|
Kristine DeRuchie
|
Michael Heilig
This study examines the classification of AI-generated clinical multiple-choice questions drafts as “helpful” or “non-helpful” starting points. Expert judgments were analyzed, and multiple classifiers were evaluated—including feature-based models, fine-tuned transformers, and few-shot prompting with GPT-4. Our findings highlight the challenges and considerations for evaluation methods of AI-generated items in clinical test development.
pdf
bib
abs
Numeric Information in Elementary School Texts Generated by LLMs vs Human Experts
Anastasia Smirnova
|
Erin S. Lee
|
Shiying Li
We analyze GPT-4o’s ability to represent numeric information in texts for elementary school children and assess it with respect to the human baseline. We show that both humans and GPT-4o reduce the amount of numeric information when adapting informational texts for children but GPT-4o retains more complex numeric types than humans do.
pdf
bib
abs
Towards evaluating teacher discourse without task-specific fine-tuning data
Beata Beigman Klebanov
|
Michael Suhan
|
Jamie N. Mikeska
Teaching simulations with feedback are one way to provide teachers with practice opportunities to help improve their skill. We investigated methods to build evaluation models of teacher performance in leading a discussion in a simulated classroom, particularly for tasks with little performance data.
pdf
bib
abs
Linguistic proficiency of humans and LLMs in Japanese: Effects of task demands and content
May Lynn Reese
|
Anastasia Smirnova
We evaluate linguistic proficiency of humans and LLMs on pronoun resolution in Japanese, using the Winograd Schema Challenge dataset. Humans outperform LLMs in the baseline condition, but we find evidence for task demand effectss in both humans and LLMs. We also found that LLMs surpass human performance in scenarios referencing US culture, providing strong evidence for content effects.
pdf
bib
abs
Generative AI Teaching Simulations as Formative Assessment Tools within Preservice Teacher Preparation
Jamie N. Mikeska
|
Aakanksha Bhatia
|
Shreyashi Halder
|
Tricia Maxwell
|
Beata Beigman Klebanov
|
Benny Longwill
|
Kashish Behl
|
Calli Shekell
This paper examines how generative AI (GenAI) teaching simulations can be used as a formative assessment tool to gain insight into elementary preservice teachers’ (PSTs’) instructional abilities. This study investigated the teaching moves PSTs used to elicit student thinking in a GenAI simulation and their perceptions of the simulation’s
pdf
bib
abs
Using LLMs to identify features of personal and professional skills in an open-response situational judgment test
Cole Walsh
|
Rodica Ivan
|
Muhammad Zafar Iqbal
|
Colleen Robb
Current methods for assessing personal and professional skills lack scalability due to reliance on human raters, while NLP-based systems for assessing these skills fail to demonstrate construct validity. This study introduces a new method utilizing LLMs to extract construct-relevant features from responses to an assessment of personal and professional skills.
pdf
bib
abs
Automated Evaluation of Standardized Patients with LLMs
Andrew Emerson
|
Le An Ha
|
Keelan Evanini
|
Su Somay
|
Kevin Frome
|
Polina Harik
|
Victoria Yaneva
Standardized patients (SPs) are essential for clinical reasoning assessments in medical education. This paper introduces evaluation metrics that apply to both human and simulated SP systems. The metrics are computed using two LLM-as-a-judge approaches that align with human evaluators on SP performance, enabling scalable formative clinical reasoning assessments.
pdf
bib
abs
LLM-Human Alignment in Evaluating Teacher Questioning Practices: Beyond Ratings to Explanation
Ruikun Hou
|
Tim Fütterer
|
Babette Bühler
|
Patrick Schreyer
|
Peter Gerjets
|
Ulrich Trautwein
|
Enkelejda Kasneci
This study investigates the alignment between large language models (LLMs) and human raters in assessing teacher questioning practices, moving beyond rating agreement to the evidence selected to justify their decisions. Findings highlight LLMs’ potential to support large-scale classroom observation through interpretable, evidence-based scoring, with possible implications for concrete teacher feedback.
pdf
bib
abs
Leveraging Fine-tuned Large Language Models in Item Parameter Prediction
Suhwa Han
|
Frank Rijmen
|
Allison Ames Boykin
|
Susan Lottridge
The study introduces novel approaches for fine-tuning pre-trained LLMs to predict item response theory parameters directly from item texts and structured item attribute variables. The proposed methods were evaluated on a dataset over 1,000 English Language Art items that are currently in the operational pool for a large scale assessment.
pdf
bib
abs
How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment
Julie Jung
|
Max Lu
|
Sina Chole Benker
|
Dogus Darici
We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment with human raters in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Findings reveal both the potential for scalable LLM-raters and the risks of relying on them exclusively.
pdf
bib
abs
Assessing AI skills: A washback point of view
Meirav Arieli-Attali
|
Beata Beigman Klebanov
|
Tenaha O’Reilly
|
Diego Zapata-Rivera
|
Tami Sabag-Shushan
|
Iman Awadie
The emerging dominance of AI in the perception of skills-of-the-future makes assessing AI skills necessary to help guide learning. Creating an assessment of AI skills poses some new challenges. We examine those from the point of view of washback, and exemplify using two exploration studies conducted with 9th grade students.
pdf
bib
abs
Using Generative AI to Develop a Common Metric in Item Response Theory
Peter Baldwin
We propose a method for linking independently calibrated item response theory (IRT) scales using large language models to generate shared parameter estimates across forms. Applied to medical licensure data, the approach reliably recovers slope values across all conditions and yields accurate intercepts when cross-form differences in item difficulty are small.
pdf
bib
abs
Augmented Measurement Framework for Dynamic Validity and Reciprocal Human-AI Collaboration in Assessment
Taiwo Feyijimi
|
Daniel O Oyeniran
|
Oukayode Apata
|
Henry Sanmi Makinde
|
Hope Oluwaseun Adegoke
|
John Ajamobe
|
Justice Dadzie
The proliferation of Generative Artificial Intelligence presents unprecedented opportunities and profound challenges for educational measurement. This study introduces the Augmented Measurement Framework grounded in four core principles. The paper discussed practical applications, implications for professional development and policy, and charts a research agenda for advancing this framework in educational measurement.
pdf
bib
abs
Patterns of Inquiry, Scaffolding, and Interaction Profiles in Learner-AI Collaborative Math Problem-Solving
Zilong Pan
|
Shen Ba
|
Zilu Jiang
|
Chenglu Li
This study investigates inquiry and scaffolding patterns between students and MathPal, a math AI agent, during problem-solving tasks. Using qualitative coding, lag sequential analysis, and Epistemic Network Analysis, the study identifies distinct interaction profiles, revealing how personalized AI feedback shapes student learning behaviors and inquiry dynamics in mathematics problem-solving activities.
pdf
bib
abs
Pre-trained Transformer Models for Standard-to-Standard Alignment Study
Hye-Jeong Choi
|
Reese Butterfuss
|
Meng Fan
The current study evaluated the accuracy of five pre-trained large language models (LLMs) in matching human judgment for standard-to-standard alignment study. Results demonstrated comparable performance LLMs across despite differences in scale and computational demands. Additionally, incorporating domain labels as auxiliary information did not enhance LLMs performance. These findings provide initial evidence for the viability of open-source LLMs to facilitate alignment study and offer insights into the utility of auxiliary information.
pdf
bib
abs
From Entropy to Generalizability: Strengthening Automated Essay Scoring Reliability and Sustainability
Yi Gui
Generalizability Theory with entropy-derived stratification optimized automated essay scoring reliability. A G-study decomposed variance across 14 encoders and 3 seeds; D-studies identified minimal ensembles achieving G ≥ 0.85. A hybrid of one medium and one small encoder with two seeds maximized dependability per compute cost. Stratification ensured uniform precision across
pdf
bib
abs
Undergraduate Students’ Appraisals and Rationales of AI Fairness in Higher Education
Victoria Delaney
|
Sunday Stein
|
Lily Sawi
|
Katya Hernandez Holliday
To measure learning with AI, students must be afforded opportunities to use AI consistently across courses. Our interview study of 36 undergraduates revealed that students make independent appraisals of AI fairness amid school policies and use AI inconsistently on school assignments. We discuss tensions for measurement raised from students’ responses.
pdf
bib
abs
AI-Generated Formative Practice and Feedback: Performance Benchmarks and Applications in Higher Education
Rachel van Campenhout
|
Michelle Weaver Clark
|
Jeffrey S. Dittel
|
Bill Jerome
|
Nick Brown
|
Benny Johnson
Millions of AI-generated formative practice questions across thousands of publisher etextbooks are available for student use in higher education. We review the research to address both performance metrics for questions and feedback calculated from student data, and discuss the importance of successful applications in the classroom to maximize learning potential.
pdf
bib
abs
Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation
Danielle R Thomas
|
Conrad Borchers
|
Ken Koedinger
Humans are biased, inconsistent, and yet we keep trusting them to define “ground truth.” This paper questions the overreliance on inter-rater reliability in educational AI and proposes a multidimensional approach leveraging expert-based approaches and close-the-loop validity to build annotations that reflect impact, not just agreement. It’s time we do better.
pdf
bib
abs
Automated search algorithm for optimal generalized linear mixed models (GLMMs)
Miryeong Koo
|
Jinming Zhang
Only a limited number of predictors can be included in a generalized linear mixed model (GLMM) due to estimation algorithm divergence. This study aims to propose a machine learning based algorithm (e.g., random forest) that can consider all predictors without the convergence issue and automatically searches for the optimal GLMMs.
pdf
bib
abs
Exploring the Psychometric Validity of AI-Generated Student Responses: A Study on Virtual Personas’ Learning Motivation
Huanxiao Wang
This study explores whether large language models (LLMs) can simulate valid student responses for educational measurement. Using GPT-4o, 2000 virtual student personas were generated. Each persona completed the Academic Motivation Scale (AMS). Factor analyses(EFA and CFA) and clustering showed GPT-4o reproduced the AMS structure and distinct motivational subgroups.
pdf
bib
abs
Measuring Teaching with LLMs
Michael Hardy
This paper introduces custom Large Language Models using sentence-level embeddings to measure teaching quality. The models achieve human-level performance in analyzing classroom transcripts, outperforming average human rater correlation. Aggregate model scores align with student learning outcomes, establishing a powerful new methodology for scalable teacher feedback. Important limitations discussed.
pdf
bib
abs
Simulating Rating Scale Responses with LLMs for Early-Stage Item Evaluation
Onur Demirkaya
|
Hsin-Ro Wei
|
Evelyn Johnson
This study explores the use of large language models to simulate human responses to Likert-scale items. A DeBERTa-base model fine-tuned with item text and examinee ability emulates a graded response model (GRM). High alignment with GRM probabilities and reasonable threshold recovery support LLMs as scalable tools for early-stage item evaluation.
pdf
bib
abs
Bias and Reliability in AI Safety Assessment: Multi-Facet Rasch Analysis of Human Moderators
Chunling Niu
|
Kelly Bradley
|
Biao Ma
|
Brian Waltman
|
Loren Cossette
|
Rui Jin
Using Multi-Facet Rasch Modeling on 36,400 safety ratings of AI-generated conversations, we reveal significant racial disparities (Asian 39.1%, White 28.7% detection rates) and content-specific bias patterns. Simulations show that diverse teams of 8-10 members achieve 70%+ reliability versus 62% for smaller homogeneous teams, providing evidence-based guidelines for AI-generated content moderation.
pdf
bib
abs
Dynamic Bayesian Item Response Model with Decomposition (D-BIRD): Modeling Cohort and Individual Learning Over Time
Hansol Lee
|
Jason B. Cho
|
David S. Matteson
|
Benjamin Domingue
We present D-BIRD, a Bayesian dynamic item response model for estimating student ability from sparse, longitudinal assessments. By decomposing ability into a cohort trend and individual trajectory, D-BIRD supports interpretable modeling of learning over time. We evaluate parameter recovery in simulation and demonstrate the model using real-world personalized learning data.
pdf
bib
abs
Enhancing Essay Scoring with GPT-2 Using Back Translation Techniques
Aysegul Gunduz
|
Mark Gierl
|
Okan Bulut
This study evaluates GPT-2 (small) for automated essay scoring on the ASAP dataset. Back-translation (English–Turkish–English) improved performance, especially on imbalanced sets. QWK scores peaked at 0.77. Findings highlight augmentation’s value and the need for more advanced, rubric-aware models for fairer assessment.
pdf
bib
abs
Mathematical Computation and Reasoning Errors by Large Language Models
Liang Zhang
|
Edith Graf
We evaluate four LLMs (GPT-4o, o1, DeepSeek-V3, DeepSeek-R1) on purposely challenging arithmetic, algebra, and number-theory items. Coding final answers and step-level solutions correctness reveals performance gaps, improvement paths, and how accurate LLMs can strengthen mathematics assessment and instruction.