Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish (Editors)

Anthology ID:: 2025.aimecon-main
Month:: October
Year:: 2025
Address:: Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
Venue:: AIME-Con
SIG:
Publisher:: National Council on Measurement in Education (NCME)
URL:: https://aclanthology.org/2025.aimecon-main/
DOI:
ISBN:: 979-8-218-84228-4
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2025.aimecon-main.pdf

Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Joshua Wilson | Christopher Ormerod | Magdalen Beiting Parrish

pdf bib abs

Input Optimization for Automated Scoring in Reading Assessment
Ji Yoon Jung | Ummugul Bezirhan | Matthias von Davier

This study examines input optimization for enhanced efficiency in automated scoring (AS) of reading assessments, which typically involve lengthy passages and complex scoring guides. We propose optimizing input size using question-specific summaries and simplified scoring guides. Findings indicate that input optimization via compression is achievable while maintaining AS performance.

pdf bib abs

19 K-12 teachers participated in a co-design pilot study of an AI education platform, testing assessment grading. Teachers valued AI’s rapid narrative feedback for formative assessment but distrusted automated scoring, preferring human oversight. Students appreciated immediate feedback but remained skeptical of AI-only grading, highlighting needs for trustworthy, teacher-centered AI tools.

pdf bib abs

Compare Several Supervised Machine Learning Methods in Detecting Aberrant Response Pattern
Yi Lu | Yu Zhang | Lorin Mueller

An aberrant response pattern, e.g., a test taker is able to answer difficult questions correctly, but is unable to answer easy questions correctly, are first identified lz and lz*. We then compared the performance of five supervised machine learning methods in detecting aberrant response pattern identified by lz or lz*.

pdf bib abs

Leveraging multi-AI agents for a teacher co-design
Hongwen Guo | Matthew S. Johnson | Luis Saldivia | Michelle Worthington | Kadriye Ercikan

This study uses multi-AI agents to accelerate teacher co-design efforts. It innovatively links student profiles obtained from numerical assessment data to AI agents in natural languages. The AI agents simulate human inquiry, enrich feedback and ground it in teachers’ knowledge and practice, showing significant potential for transforming assessment practice and research.

pdf bib abs

Long context Automated Essay Scoring with Language Models
Christopher Ormerod | Gitit Kehat

In this study, we evaluate several models that incorporate architectural modifications to overcome the length limitations of the standard transformer architecture using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.

pdf bib abs

Optimizing Reliability Scoring for ILSAs
Ji Yoon Jung | Ummugul Bezirhan | Matthias von Davier

This study proposes an innovative method for evaluating cross-country scoring reliability (CCSR) in multilingual assessments, using hyperparameter optimization and a similarity-based weighted majority scoring within a single human scoring framework. Results show that this approach provides a cost-effective and comprehensive assessment of CCSR without the need for additional raters.

pdf bib abs

Exploring AI-Enabled Test Practice, Affect, and Test Outcomes in Language Assessment
Jill Burstein | Ramsey Cardwell | Ping-Lin Chuang | Allison Michalowski | Steven Nydick

We analyzed data from 25,969 test takers of a high-stakes, computer-adaptive English proficiency test to examine relationships between repeated use of AI-generated practice tests and performance, affect, and score-sharing behavior. Taking 1–3 practice tests was associated with higher scores and confidence, while higher usage showed different engagement and outcome

pdf bib abs

Develop a Generic Essay Scorer for Practice Writing Tests of Statewide Assessments
Yi Gui

This study examines whether NLP transfer learning techniques, specifically BERT, can be used to develop prompt-generic AES models for practice writing tests. Findings reveal that fine-tuned DistilBERT, without further pre-training, achieves high agreement (QWK ≈ 0.89), enabling scalable, robust AES models in statewide K-12 assessments without costly supplementary pre-training.

pdf bib abs

This pilot study investigated the use of a pedagogical agent to administer a conversational survey to second graders following a digital reading activity, measuring comprehension, persistence, and enjoyment. Analysis of survey responses and behavioral log data provide evidence for recommendations for the design of agent-mediated assessment in early literacy.

pdf bib abs

LLM-Based Approaches for Detecting Gaming the System in Self-Explanation
Jiayi (Joyce) Zhang | Ryan S. Baker | Bruce M. McLaren

This study compares two LLM-based approaches for detecting gaming behavior in students’ open-ended responses within a math digital learning game. The sentence embedding method outperformed the prompt-based approach and was more conservative. Consistent with prior research, gaming correlated negatively with learning, highlighting LLMs’ potential to detect disengagement in open-ended tasks.

pdf bib abs

Evaluating the Impact of LLM-guided Reflection on Learning Outcomes with Interactive AI-Generated Educational Podcasts
Vishnu Menon | Andy Cherney | Elizabeth B. Cloude | Li Zhang | Tiffany D. Do

This study examined whether embedding LLM-guided reflection prompts in an interactive AI-generated podcast improved learning and user experience compared to a version without prompts. Thirty-six undergraduates participated, and while learning outcomes were similar across conditions, reflection prompts reduced perceived attractiveness, highlighting a call for more research on reflective interactivity design.

pdf bib abs

Generative AI in the K–12 Formative Assessment Process: Enhancing Feedback in the Classroom
Michael Maksimchuk | Edward Roeber | Davie Store

This paper explores how generative AI can enhance formative assessment practices in K–12 education. It examines emerging tools, ethical considerations, and practical applications to support student learning, while emphasizing the continued importance of teacher judgment and balanced assessment systems.

pdf bib abs

Using Large Language Models to Analyze Students’ Collaborative Argumentation in Classroom Discussions
Nhat Tran | Diane Litman | Amanda Godley

Collaborative argumentation enables students to build disciplinary knowledge and to think in disciplinary ways. We use Large Language Models (LLMs) to improve existing methods for collaboration classification and argument identification. Results suggest that LLMs are effective for both tasks and should be considered as a strong baseline for future research.

pdf bib abs

Evaluating Generative AI as a Mentor Resource: Bias and Implementation Challenges
Jimin Lee | Alena G Esposito

We explored how students’ perceptions of helpfulness and caring skew their ability to identify AI versus human mentorship responses. Emotionally resonant responses often lead to misattributions, indicating perceptual biases that shape mentorship judgments. The findings inform ethical, relational, and effective integration of AI in student support.

pdf bib abs

AI-Based Classification of TIMSS Items for Framework Alignment
Ummugul Bezirhan | Matthias von Davier

Large-scale assessments rely on expert panels to verify that test items align with prescribed frameworks, a labor-intensive process. This study evaluates the use of GPT-4o to classify TIMSS items to content domain, cognitive domain, and difficulty categories. Findings highlight the potential of language models to support scalable, framework-aligned item verification.

pdf bib abs

Towards Reliable Generation of Clinical Chart Items: A Counterfactual Reasoning Approach with Large Language Models
Jiaxuan Li | Saed Rezayi | Peter Baldwin | Polina Harik | Victoria Yaneva

This study explores GPT-4 for generating clinical chart items in medical education using three prompting strategies. Expert evaluations found many items usable or promising. The counterfactual approach enhanced novelty, and item quality improved with high-surprisal examples. This is the first investigation of LLMs for automated clinical chart item generation.

pdf bib abs

Using Whisper Embeddings for Audio-Only Latent Token Classification of Classroom Management Practices
Wesley Griffith Morris | Jessica Vitale | Isabel Arvelo

In this study, we developed a textless NLP system using a fine-tuned Whisper encoder to identify classroom management practices from noisy classroom recordings. The model segments teacher speech from non-teacher speech and performs multi-label classification of classroom practices, achieving acceptable accuracy without requiring transcript generation.

pdf bib abs

Comparative Study of Double Scoring Design for Measuring Mathematical Quality of Instruction
Jonathan Kyle Foster | James Drimalla | Nursultan Japashov

The integration of automated scoring and addressing whether it might meet the extensive need for double scoring in classroom observation systems is the focus of this study. We outline an accessible approach for determining the interchangeability of automated systems within comparative scoring design studies.

pdf bib abs

This study examines the classification of AI-generated clinical multiple-choice questions drafts as “helpful” or “non-helpful” starting points. Expert judgments were analyzed, and multiple classifiers were evaluated—including feature-based models, fine-tuned transformers, and few-shot prompting with GPT-4. Our findings highlight the challenges and considerations for evaluation methods of AI-generated items in clinical test development.

pdf bib abs

Numeric Information in Elementary School Texts Generated by LLMs vs Human Experts
Anastasia Smirnova | Erin S. Lee | Shiying Li

We analyze GPT-4o’s ability to represent numeric information in texts for elementary school children and assess it with respect to the human baseline. We show that both humans and GPT-4o reduce the amount of numeric information when adapting informational texts for children but GPT-4o retains more complex numeric types than humans do.

pdf bib abs

Towards evaluating teacher discourse without task-specific fine-tuning data
Beata Beigman Klebanov | Michael Suhan | Jamie N. Mikeska

Teaching simulations with feedback are one way to provide teachers with practice opportunities to help improve their skill. We investigated methods to build evaluation models of teacher performance in leading a discussion in a simulated classroom, particularly for tasks with little performance data.

pdf bib abs

Linguistic proficiency of humans and LLMs in Japanese: Effects of task demands and content
May Lynn Reese | Anastasia Smirnova

We evaluate linguistic proficiency of humans and LLMs on pronoun resolution in Japanese, using the Winograd Schema Challenge dataset. Humans outperform LLMs in the baseline condition, but we find evidence for task demand effectss in both humans and LLMs. We also found that LLMs surpass human performance in scenarios referencing US culture, providing strong evidence for content effects.

pdf bib abs

This paper examines how generative AI (GenAI) teaching simulations can be used as a formative assessment tool to gain insight into elementary preservice teachers’ (PSTs’) instructional abilities. This study investigated the teaching moves PSTs used to elicit student thinking in a GenAI simulation and their perceptions of the simulation’s

pdf bib abs

Using LLMs to identify features of personal and professional skills in an open-response situational judgment test
Cole Walsh | Rodica Ivan | Muhammad Zafar Iqbal | Colleen Robb

Current methods for assessing personal and professional skills lack scalability due to reliance on human raters, while NLP-based systems for assessing these skills fail to demonstrate construct validity. This study introduces a new method utilizing LLMs to extract construct-relevant features from responses to an assessment of personal and professional skills.

pdf bib abs

Standardized patients (SPs) are essential for clinical reasoning assessments in medical education. This paper introduces evaluation metrics that apply to both human and simulated SP systems. The metrics are computed using two LLM-as-a-judge approaches that align with human evaluators on SP performance, enabling scalable formative clinical reasoning assessments.

pdf bib abs

This study investigates the alignment between large language models (LLMs) and human raters in assessing teacher questioning practices, moving beyond rating agreement to the evidence selected to justify their decisions. Findings highlight LLMs’ potential to support large-scale classroom observation through interpretable, evidence-based scoring, with possible implications for concrete teacher feedback.

pdf bib abs

Leveraging Fine-tuned Large Language Models in Item Parameter Prediction
Suhwa Han | Frank Rijmen | Allison Ames Boykin | Susan Lottridge

The study introduces novel approaches for fine-tuning pre-trained LLMs to predict item response theory parameters directly from item texts and structured item attribute variables. The proposed methods were evaluated on a dataset over 1,000 English Language Art items that are currently in the operational pool for a large scale assessment.

pdf bib abs

How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment
Julie Jung | Max Lu | Sina Chole Benker | Dogus Darici

We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment with human raters in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Findings reveal both the potential for scalable LLM-raters and the risks of relying on them exclusively.

pdf bib abs

The emerging dominance of AI in the perception of skills-of-the-future makes assessing AI skills necessary to help guide learning. Creating an assessment of AI skills poses some new challenges. We examine those from the point of view of washback, and exemplify using two exploration studies conducted with 9th grade students.

pdf bib abs

Using Generative AI to Develop a Common Metric in Item Response Theory
Peter Baldwin

We propose a method for linking independently calibrated item response theory (IRT) scales using large language models to generate shared parameter estimates across forms. Applied to medical licensure data, the approach reliably recovers slope values across all conditions and yields accurate intercepts when cross-form differences in item difficulty are small.

pdf bib abs

The proliferation of Generative Artificial Intelligence presents unprecedented opportunities and profound challenges for educational measurement. This study introduces the Augmented Measurement Framework grounded in four core principles. The paper discussed practical applications, implications for professional development and policy, and charts a research agenda for advancing this framework in educational measurement.

pdf bib abs

Patterns of Inquiry, Scaffolding, and Interaction Profiles in Learner-AI Collaborative Math Problem-Solving
Zilong Pan | Shen Ba | Zilu Jiang | Chenglu Li

This study investigates inquiry and scaffolding patterns between students and MathPal, a math AI agent, during problem-solving tasks. Using qualitative coding, lag sequential analysis, and Epistemic Network Analysis, the study identifies distinct interaction profiles, revealing how personalized AI feedback shapes student learning behaviors and inquiry dynamics in mathematics problem-solving activities.

pdf bib abs

Pre-trained Transformer Models for Standard-to-Standard Alignment Study
Hye-Jeong Choi | Reese Butterfuss | Meng Fan

The current study evaluated the accuracy of five pre-trained large language models (LLMs) in matching human judgment for standard-to-standard alignment study. Results demonstrated comparable performance LLMs across despite differences in scale and computational demands. Additionally, incorporating domain labels as auxiliary information did not enhance LLMs performance. These findings provide initial evidence for the viability of open-source LLMs to facilitate alignment study and offer insights into the utility of auxiliary information.

pdf bib abs

From Entropy to Generalizability: Strengthening Automated Essay Scoring Reliability and Sustainability
Yi Gui

Generalizability Theory with entropy-derived stratification optimized automated essay scoring reliability. A G-study decomposed variance across 14 encoders and 3 seeds; D-studies identified minimal ensembles achieving G ≥ 0.85. A hybrid of one medium and one small encoder with two seeds maximized dependability per compute cost. Stratification ensured uniform precision across

pdf bib abs

Undergraduate Students’ Appraisals and Rationales of AI Fairness in Higher Education
Victoria Delaney | Sunday Stein | Lily Sawi | Katya Hernandez Holliday

To measure learning with AI, students must be afforded opportunities to use AI consistently across courses. Our interview study of 36 undergraduates revealed that students make independent appraisals of AI fairness amid school policies and use AI inconsistently on school assignments. We discuss tensions for measurement raised from students’ responses.

pdf bib abs

Millions of AI-generated formative practice questions across thousands of publisher etextbooks are available for student use in higher education. We review the research to address both performance metrics for questions and feedback calculated from student data, and discuss the importance of successful applications in the classroom to maximize learning potential.

pdf bib abs

Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation
Danielle R Thomas | Conrad Borchers | Ken Koedinger

Humans are biased, inconsistent, and yet we keep trusting them to define “ground truth.” This paper questions the overreliance on inter-rater reliability in educational AI and proposes a multidimensional approach leveraging expert-based approaches and close-the-loop validity to build annotations that reflect impact, not just agreement. It’s time we do better.

pdf bib abs

Automated search algorithm for optimal generalized linear mixed models (GLMMs)
Miryeong Koo | Jinming Zhang

Only a limited number of predictors can be included in a generalized linear mixed model (GLMM) due to estimation algorithm divergence. This study aims to propose a machine learning based algorithm (e.g., random forest) that can consider all predictors without the convergence issue and automatically searches for the optimal GLMMs.

pdf bib abs

Exploring the Psychometric Validity of AI-Generated Student Responses: A Study on Virtual Personas’ Learning Motivation
Huanxiao Wang

This study explores whether large language models (LLMs) can simulate valid student responses for educational measurement. Using GPT-4o, 2000 virtual student personas were generated. Each persona completed the Academic Motivation Scale (AMS). Factor analyses(EFA and CFA) and clustering showed GPT-4o reproduced the AMS structure and distinct motivational subgroups.

pdf bib abs

Measuring Teaching with LLMs
Michael Hardy

This paper introduces custom Large Language Models using sentence-level embeddings to measure teaching quality. The models achieve human-level performance in analyzing classroom transcripts, outperforming average human rater correlation. Aggregate model scores align with student learning outcomes, establishing a powerful new methodology for scalable teacher feedback. Important limitations discussed.

pdf bib abs

Simulating Rating Scale Responses with LLMs for Early-Stage Item Evaluation
Onur Demirkaya | Hsin-Ro Wei | Evelyn Johnson

This study explores the use of large language models to simulate human responses to Likert-scale items. A DeBERTa-base model fine-tuned with item text and examinee ability emulates a graded response model (GRM). High alignment with GRM probabilities and reasonable threshold recovery support LLMs as scalable tools for early-stage item evaluation.

pdf bib abs

Using Multi-Facet Rasch Modeling on 36,400 safety ratings of AI-generated conversations, we reveal significant racial disparities (Asian 39.1%, White 28.7% detection rates) and content-specific bias patterns. Simulations show that diverse teams of 8-10 members achieve 70%+ reliability versus 62% for smaller homogeneous teams, providing evidence-based guidelines for AI-generated content moderation.

pdf bib abs

Dynamic Bayesian Item Response Model with Decomposition (D-BIRD): Modeling Cohort and Individual Learning Over Time
Hansol Lee | Jason B. Cho | David S. Matteson | Benjamin Domingue

We present D-BIRD, a Bayesian dynamic item response model for estimating student ability from sparse, longitudinal assessments. By decomposing ability into a cohort trend and individual trajectory, D-BIRD supports interpretable modeling of learning over time. We evaluate parameter recovery in simulation and demonstrate the model using real-world personalized learning data.

pdf bib abs

Enhancing Essay Scoring with GPT-2 Using Back Translation Techniques
Aysegul Gunduz | Mark Gierl | Okan Bulut

This study evaluates GPT-2 (small) for automated essay scoring on the ASAP dataset. Back-translation (English–Turkish–English) improved performance, especially on imbalanced sets. QWK scores peaked at 0.77. Findings highlight augmentation’s value and the need for more advanced, rubric-aware models for fairer assessment.

pdf bib abs

Mathematical Computation and Reasoning Errors by Large Language Models
Liang Zhang | Edith Graf

We evaluate four LLMs (GPT-4o, o1, DeepSeek-V3, DeepSeek-R1) on purposely challenging arithmetic, algebra, and number-theory items. Coding final answers and step-level solutions correctness reveals performance gaps, improvement paths, and how accurate LLMs can strengthen mathematics assessment and instruction.