Beata Beigman Klebanov

Also published as: Beata Beigman Klebanov

2026

Data-lean fine-tuning of models for evaluating teacher performance in a GenAI-led elicitation simulation
Beata Beigman Klebanov | Andrew Hoang | Jamie Mikeska | Benny Longwill | Sanjna Kashyap | Shreyashi Halder | Aakanksha Bhatia
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

Recent advances in the capabilities of conversational agents based on large language models make them a very promising tool for role playing K-12 students in order to train educators in conversational teaching practices, such as eliciting student thinking, explaining disciplinary content, and facilitating a classroom discussion. In fact, such simulations can and have been developed relatively quickly and without data to machine-learn from – neither classroom data nor human-simulated data. To enhance the usefulness and effectiveness of such teaching simulations, it is necessary to provide pedagogically sound, timely, and personalized feedback to the educator about their simulation performance. In this study, we present experiments on fine-tuning models to evaluate educator performance in an elicitation teaching simulation. The models are developed with data collected during usability testing of the simulation and evaluated on real user data. We show that even with relatively little fine-tuning data, robust performance can be obtained

pdf bib abs

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
Abigail Gurin Schleifer | Moriah Ariely | Beata Beigman Klebanov | Asaf Salman | Giora Alexandron
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs’ broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored.We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert.The results show that human–human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.

2025

pdf bib abs

Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT-4o
Yuya Asano | Beata Beigman Klebanov | Jamie Mikeska
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Engaging students in a coherent classroom discussion is one aspect of high-quality instruction and is an important skill that requires practice to acquire. With the goal of providing teachers with formative feedback on their classroom discussions, we investigate automated means for evaluating teachers’ ability to lead coherent discussions in simulated classrooms. While prior work has shown the effectiveness of large language models (LLMs) in assessing the coherence of relatively short texts, it has also found that LLMs struggle when assessing instructional quality. We evaluate the generalizability of task formulation strategies for assessing the coherence of classroom discussions across different subject domains using GPT-4o and discuss how these formulations address the previously reported challenges—the overestimation of instructional quality and the inability to extract relevant parts of discussions. Finally, we report lack of generalizability across domains and the misalignment with humans in the use of evidence from discussions as remaining challenges.

pdf bib abs

Assessing AI skills: A washback point of view
Meirav Arieli-Attali | Beata Beigman Klebanov | Tenaha O’Reilly | Diego Zapata-Rivera | Tami Sabag-Shushan | Iman Awadie
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

The emerging dominance of AI in the perception of skills-of-the-future makes assessing AI skills necessary to help guide learning. Creating an assessment of AI skills poses some new challenges. We examine those from the point of view of washback, and exemplify using two exploration studies conducted with 9th grade students.

pdf bib abs

Generative AI Teaching Simulations as Formative Assessment Tools within Preservice Teacher Preparation
Jamie N. Mikeska | Aakanksha Bhatia | Shreyashi Halder | Tricia Maxwell | Beata Beigman Klebanov | Benny Longwill | Kashish Behl | Calli Shekell
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

This paper examines how generative AI (GenAI) teaching simulations can be used as a formative assessment tool to gain insight into elementary preservice teachers’ (PSTs’) instructional abilities. This study investigated the teaching moves PSTs used to elicit student thinking in a GenAI simulation and their perceptions of the simulation’s

pdf bib abs

Towards evaluating teacher discourse without task-specific fine-tuning data
Beata Beigman Klebanov | Michael Suhan | Jamie N. Mikeska
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

Teaching simulations with feedback are one way to provide teachers with practice opportunities to help improve their skill. We investigated methods to build evaluation models of teacher performance in leading a discussion in a simulated classroom, particularly for tasks with little performance data.

2024

pdf bib abs

From Miscue to Evidence of Difficulty: Analysis of Automatically Detected Miscues in Oral Reading for Feedback Potential
Beata Beigman Klebanov | Michael Suhan | Tenaha O’Reilly | Zuowei Wang
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This research is situated in the space between an existing NLP capability and its use(s) in an educational context. We analyze oral reading data collected with a deployed automated speech analysis software and consider how the results of automated speech analysis can be interpreted and used to inform the ideation and design of a new feature – feedback to learners and teachers. Our analysis shows how the details of the system’s performance and the details of the context of use both significantly impact the ideation process.

pdf bib abs

Anna Karenina Strikes Again: Pre-Trained LLM Embeddings May Favor High-Performing Learners
Abigail Gurin Schleifer | Beata Beigman Klebanov | Moriah Ariely | Giora Alexandron
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Unsupervised clustering of student responses to open-ended questions into behavioral and cognitive profiles using pre-trained LLM embeddings is an emerging technique, but little is known about how well this captures pedagogically meaningful information. We investigate this in the context of student responses to open-ended questions in biology, which were previously analyzed and clustered by experts into theory-driven Knowledge Profiles (KPs).Comparing these KPs to ones discovered by purely data-driven clustering techniques, we report poor discoverability of most KPs, except for the ones including the correct answers. We trace this ‘discoverability bias’ to the representations of KPs in the pre-trained LLM embeddings space.

pdf bib abs

Automated Evaluation of Teacher Encouragement of Student-to-Student Interactions in a Simulated Classroom Discussion
Michael Ilagan | Beata Beigman Klebanov | Jamie Mikeska
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Leading students to engage in argumentation-focused discussions is a challenge for elementary school teachers, as doing so requires facilitating group discussions with student-to-student interaction. The Mystery Powder (MP) Task was designed to be used in online simulated classrooms to develop teachers’ skill in facilitating small group science discussions. In order to provide timely and scaleable feedback to teachers facilitating a discussion in the simulated classroom, we employ a hybrid modeling approach that successfully combines fine-tuned large language models with features capturing important elements of the discourse dynamic to evaluate MP discussion transcripts. To our knowledge, this is the first application of a hybrid model to automate evaluation of teacher discourse.

2023

pdf bib abs

A dynamic model of lexical experience for tracking of oral reading fluency
Beata Beigman Klebanov | Michael Suhan | Zuowei Wang | Tenaha O’reilly
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

We present research aimed at solving a problem in assessment of oral reading fluency using children’s oral reading data from our online book reading app. It is known that properties of the passage being read aloud impact fluency estimates; therefore, passage-based measures are used to remove passage-related variance when estimating growth in oral reading fluency. However, passage-based measures reported in the literature tend to treat passages as independent events, without explicitly modeling accumulation of lexical experience as one reads through a book. We propose such a model and show that it helps explain additional variance in the measurements of children’s fluency as they read through a book, improving over a strong baseline. These results have implications for measuring growth in oral reading fluency.

pdf bib abs

Transformer-based Hebrew NLP models for Short Answer Scoring in Biology
Abigail Gurin Schleifer | Beata Beigman Klebanov | Moriah Ariely | Giora Alexandron
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Pre-trained large language models (PLMs) are adaptable to a wide range of downstream tasks by fine-tuning their rich contextual embeddings to the task, often without requiring much task-specific data. In this paper, we explore the use of a recently developed Hebrew PLM aleph-BERT for automated short answer grading of high school biology items. We show that the alephBERT-based system outperforms a strong CNN-based baseline, and that it general-izes unexpectedly well in a zero-shot paradigm to items on an unseen topic that address the same underlying biological concepts, opening up the possibility of automatically assessing new items without item-specific fine-tuning.