Scott Crossley

2025

Detecting Math Misconceptions: An AI Benchmark Dataset
Bethany Rittle-Johnson | Rebecca Adler | Kelley Durkin | L Burleigh | Jules King | Scott Crossley
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress

To harness the promise of AI for improving math education, AI models need to be able to diagnose math misconceptions. We created an AI benchmark dataset on math misconceptions and other instructionally-relevant errors, comprising over 52,000 explanations written over 15 math questions that were scored by expert human raters.

2024

pdf bib abs

A World CLASSE Student Summary Corpus
Scott Crossley | Perpetual Baffour | Mihai Dascalu | Stefan Ruseti
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This paper introduces the Common Lit Augmented Student Summary Evaluation (CLASSE) corpus. The corpus comprises 11,213 summaries written over six prompts by students in grades 3-12 while using the CommonLit website. Each summary was scored by expert human raters on analytic features related to main points, details, organization, voice, paraphrasing, and language beyond the source text. The human scores were aggregated into two component scores related to content and wording. The final corpus was the focus of a Kaggle competition hosted in late 2022 and completed in 2023 in which over 2,000 teams participated. The paper includes a baseline scoring model for the corpus based on a Large Language Model (Longformer model). The paper also provides an overview of the winning models from the Kaggle competition.

2023

pdf bib abs

Analyzing Bias in Large Language Model Solutions for Assisted Writing Feedback Tools: Lessons from the Feedback Prize Competition Series
Perpetual Baffour | Tor Saxberg | Scott Crossley
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

This paper analyzes winning solutions from the Feedback Prize competition series hosted from 2021-2022. The competition sought to improve Assisted Writing Feedback Tools (AWFTs) by crowdsourcing Large Language Model (LLM) solutions for evaluating student writing. The winning models are freely available for incorporation into educational applications, but the models need to be assessed for performance and other factors. This study reports the performance accuracy of Feedback Prize-winning models based on demographic factors such as student race/ethnicity, economic disadvantage, and English Language Learner status. Two competitions are analyzed. The first, which focused on identifying discourse elements, demonstrated minimal bias based on students’ demographic factors. However, the second competition, which aimed to predict discourse effectiveness, exhibited moderate bias.

2018

pdf bib abs

Linguistic Features of Sarcasm and Metaphor Production Quality
Stephen Skalicky | Scott Crossley
Proceedings of the Workshop on Figurative Language Processing

Using linguistic features to detect figurative language has provided a deeper in-sight into figurative language. The purpose of this study is to assess whether linguistic features can help explain differences in quality of figurative language. In this study a large corpus of metaphors and sarcastic responses are collected from human subjects and rated for figurative language quality based on theoretical components of metaphor, sarcasm, and creativity. Using natural language processing tools, specific linguistic features related to lexical sophistication and semantic cohesion were used to predict the human ratings of figurative language quality. Results demonstrate linguistic features were able to predict small amounts of variance in metaphor and sarcasm production quality.

Scott Crossley

2025

2024

2023

2018

2013

Co-authors

Venues