2024
pdf
bib
abs
Building Robust Content Scoring Models for Student Explanations of Social Justice Science Issues
Allison Bradford
|
Kenneth Steimel
|
Brian Riordan
|
Marcia Linn
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
With increased attention to connecting science topics to real-world contexts, like issues of social justice, teachers need support to assess student progress in explaining such issues. In this work, we explore the robustness of NLP-based automatic content scoring models that provide insight into student ability to integrate their science and social justice ideas in two different environmental science contexts. We leverage encoder-only transformer models to capture the degree to which students explain a science phenomenon, understand the intersecting justice issues, and integrate their understanding of science and social justice. We developed models training on data from each of the contexts as well as from a combined dataset. We found that the models developed in one context generate educationally useful scores in the other context. The model trained on the combined dataset performed as well as or better than the models trained on separate datasets in most cases. Quadratic weighted kappas demonstrate that these models are above threshold for use in classrooms.
2020
pdf
bib
abs
Don’t take “nswvtnvakgxpm” for an answer –The surprising vulnerability of automatic content scoring systems to adversarial input
Yuning Ding
|
Brian Riordan
|
Andrea Horbach
|
Aoife Cahill
|
Torsten Zesch
Proceedings of the 28th International Conference on Computational Linguistics
Automatic content scoring systems are widely used on short answer tasks to save human effort. However, the use of these systems can invite cheating strategies, such as students writing irrelevant answers in the hopes of gaining at least partial credit. We generate adversarial answers for benchmark content scoring datasets based on different methods of increasing sophistication and show that even simple methods lead to a surprising decrease in content scoring performance. As an extreme example, up to 60% of adversarial answers generated from random shuffling of words in real answers are accepted by a state-of-the-art scoring system. In addition to analyzing the vulnerabilities of content scoring systems, we examine countermeasures such as adversarial training and show that these measures improve system robustness against adversarial answers considerably but do not suffice to completely solve the problem.
pdf
bib
abs
Using PRMSE to evaluate automated scoring systems in the presence of label noise
Anastassia Loukina
|
Nitin Madnani
|
Aoife Cahill
|
Lili Yao
|
Matthew S. Johnson
|
Brian Riordan
|
Daniel F. McCaffrey
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
The effect of noisy labels on the performance of NLP systems has been studied extensively for system training. In this paper, we focus on the effect that noisy labels have on system evaluation. Using automated scoring as an example, we demonstrate that the quality of human ratings used for system evaluation have a substantial impact on traditional performance metrics, making it impossible to compare system evaluations on labels with different quality. We propose that a new metric, PRMSE, developed within the educational measurement community, can help address this issue, and provide practical guidelines on using PRMSE.
pdf
bib
abs
An empirical investigation of neural methods for content scoring of science explanations
Brian Riordan
|
Sarah Bichler
|
Allison Bradford
|
Jennifer King Chen
|
Korah Wiley
|
Libby Gerard
|
Marcia C. Linn
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
With the widespread adoption of the Next Generation Science Standards (NGSS), science teachers and online learning environments face the challenge of evaluating students’ integration of different dimensions of science learning. Recent advances in representation learning in natural language processing have proven effective across many natural language processing tasks, but a rigorous evaluation of the relative merits of these methods for scoring complex constructed response formative assessments has not previously been carried out. We present a detailed empirical investigation of feature-based, recurrent neural network, and pre-trained transformer models on scoring content in real-world formative assessment data. We demonstrate that recent neural methods can rival or exceed the performance of feature-based methods. We also provide evidence that different classes of neural models take advantage of different learning cues, and pre-trained transformer models may be more robust to spurious, dataset-specific learning cues, better reflecting scoring rubrics.
pdf
bib
abs
Context-based Automated Scoring of Complex Mathematical Responses
Aoife Cahill
|
James H Fife
|
Brian Riordan
|
Avijit Vajpayee
|
Dmytro Galochkin
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
The tasks of automatically scoring either textual or algebraic responses to mathematical questions have both been well-studied, albeit separately. In this paper we propose a method for automatically scoring responses that contain both text and algebraic expressions. Our method not only achieves high agreement with human raters, but also links explicitly to the scoring rubric – essentially providing explainable models and a way to potentially provide feedback to students in the future.
2019
pdf
bib
abs
How to account for mispellings: Quantifying the benefit of character representations in neural content scoring models
Brian Riordan
|
Michael Flor
|
Robert Pugh
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Character-based representations in neural models have been claimed to be a tool to overcome spelling variation in in word token-based input. We examine this claim in neural models for content scoring. We formulate precise hypotheses about the possible effects of adding character representations to word-based models and test these hypotheses on large-scale real world content scoring datasets. We find that, while character representations may provide small performance gains in general, their effectiveness in accounting for spelling variation may be limited. We show that spelling correction can provide larger gains than character representations, and that spelling correction improves the performance of models with character representations. With these insights, we report a new state of the art on the ASAP-SAS content scoring dataset.
2018
pdf
bib
abs
Atypical Inputs in Educational Applications
Su-Youn Yoon
|
Aoife Cahill
|
Anastassia Loukina
|
Klaus Zechner
|
Brian Riordan
|
Nitin Madnani
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)
In large-scale educational assessments, the use of automated scoring has recently become quite common. While the majority of student responses can be processed and scored without difficulty, there are a small number of responses that have atypical characteristics that make it difficult for an automated scoring system to assign a correct score. We describe a pipeline that detects and processes these kinds of responses at run-time. We present the most frequent kinds of what are called non-scorable responses along with effective filtering models based on various NLP and speech processing technologies. We give an overview of two operational automated scoring systems —one for essay scoring and one for speech scoring— and describe the filtering models they use. Finally, we present an evaluation and analysis of filtering models used for spoken responses in an assessment of language proficiency.
pdf
bib
abs
A Semantic Role-based Approach to Open-Domain Automatic Question Generation
Michael Flor
|
Brian Riordan
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
We present a novel rule-based system for automatic generation of factual questions from sentences, using semantic role labeling (SRL) as the main form of text analysis. The system is capable of generating both wh-questions and yes/no questions from the same semantic analysis. We present an extensive evaluation of the system and compare it to a recent neural network architecture for question generation. The SRL-based system outperforms the neural system in both average quality and variety of generated questions.
2017
pdf
bib
abs
Investigating neural architectures for short answer scoring
Brian Riordan
|
Andrea Horbach
|
Aoife Cahill
|
Torsten Zesch
|
Chong Min Lee
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
Neural approaches to automated essay scoring have recently shown state-of-the-art performance. The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions. This differs from the short answer content scoring task, which focuses on content accuracy. The inputs to neural essay scoring models – ngrams and embeddings – are arguably well-suited to evaluate content in short answer scoring tasks. We investigate how several basic neural approaches similar to those used for automated essay scoring perform on short answer scoring. We show that neural architectures can outperform a strong non-neural baseline, but performance and optimal parameter settings vary across the more diverse types of prompts typical of short answer scoring.
2016
pdf
bib
Automatically Scoring Tests of Proficiency in Music Instruction
Nitin Madnani
|
Aoife Cahill
|
Brian Riordan
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications
pdf
bib
abs
Evaluating Argumentative and Narrative Essays using Graphs
Swapna Somasundaran
|
Brian Riordan
|
Binod Gyawali
|
Su-Youn Yoon
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
This work investigates whether the development of ideas in writing can be captured by graph properties derived from the text. Focusing on student essays, we represent the essay as a graph, and encode a variety of graph properties including PageRank as features for modeling essay scores related to quality of development. We demonstrate that our approach improves on a state-of-the-art system on the task of holistic scoring of persuasive essays and on the task of scoring narrative essays along the development dimension.
2014
pdf
bib
Detecting Sociostructural Beliefs about Group Status Differences in Online Discussions
Brian Riordan
|
Heather Wade
|
Afzal Upal
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media