Anastassia Loukina


2022

pdf bib
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track
Anastassia Loukina | Rashmi Gangadharaiah | Bonan Min
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track

2020

pdf bib
Using PRMSE to evaluate automated scoring systems in the presence of label noise
Anastassia Loukina | Nitin Madnani | Aoife Cahill | Lili Yao | Matthew S. Johnson | Brian Riordan | Daniel F. McCaffrey
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

The effect of noisy labels on the performance of NLP systems has been studied extensively for system training. In this paper, we focus on the effect that noisy labels have on system evaluation. Using automated scoring as an example, we demonstrate that the quality of human ratings used for system evaluation have a substantial impact on traditional performance metrics, making it impossible to compare system evaluations on labels with different quality. We propose that a new metric, PRMSE, developed within the educational measurement community, can help address this issue, and provide practical guidelines on using PRMSE.

pdf bib
User-centered & Robust NLP OSS: Lessons Learned from Developing & Maintaining RSMTool
Nitin Madnani | Anastassia Loukina
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

For the last 5 years, we have developed and maintained RSMTool – an open-source tool for evaluating NLP systems that automatically score written and spoken responses. RSMTool is designed to be cross-disciplinary, borrowing heavily from NLP, machine learning, and educational measurement. Its cross-disciplinary nature has required us to learn a user-centered development approach in terms of both design and implementation. We share some of these lessons in this paper.

2019

pdf bib
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)
Anastassia Loukina | Michelle Morales | Rohit Kumar
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)

pdf bib
My Turn To Read: An Interleaved E-book Reading Tool for Developing and Struggling Readers
Nitin Madnani | Beata Beigman Klebanov | Anastassia Loukina | Binod Gyawali | Patrick Lange | John Sabatini | Michael Flor
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Literacy is crucial for functioning in modern society. It underpins everything from educational attainment and employment opportunities to health outcomes. We describe My Turn To Read, an app that uses interleaved reading to help developing and struggling readers improve reading skills while reading for meaning and pleasure. We hypothesize that the longer-term impact of the app will be to help users become better, more confident readers with an increased stamina for extended reading. We describe the technology and present preliminary evidence in support of this hypothesis.

pdf bib
The many dimensions of algorithmic fairness in educational applications
Anastassia Loukina | Nitin Madnani | Klaus Zechner
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

The issues of algorithmic fairness and bias have recently featured prominently in many publications highlighting the fact that training the algorithms for maximum performance may often result in predictions that are biased against various groups. Educational applications based on NLP and speech processing technologies often combine multiple complex machine learning algorithms and are thus vulnerable to the same sources of bias as other machine learning systems. Yet such systems can have high impact on people’s lives especially when deployed as part of high-stakes tests. In this paper we discuss different definitions of fairness and possible ways to apply them to educational applications. We then use simulated and real data to consider how test-takers’ native language backgrounds can affect their automated scores on an English language proficiency assessment. We illustrate that total fairness may not be achievable and that different definitions of fairness may require different solutions.

2018

pdf bib
Towards Understanding Text Factors in Oral Reading
Anastassia Loukina | Van Rynald T. Liceralde | Beata Beigman Klebanov
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Using a case study, we show that variation in oral reading rate across passages for professional narrators is consistent across readers and much of it can be explained using features of the texts being read. While text complexity is a poor predictor of the reading rate, a substantial share of variability can be explained by timing and story-based factors with performance reaching r=0.75 for unseen passages and narrator.

pdf bib
Atypical Inputs in Educational Applications
Su-Youn Yoon | Aoife Cahill | Anastassia Loukina | Klaus Zechner | Brian Riordan | Nitin Madnani
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

In large-scale educational assessments, the use of automated scoring has recently become quite common. While the majority of student responses can be processed and scored without difficulty, there are a small number of responses that have atypical characteristics that make it difficult for an automated scoring system to assign a correct score. We describe a pipeline that detects and processes these kinds of responses at run-time. We present the most frequent kinds of what are called non-scorable responses along with effective filtering models based on various NLP and speech processing technologies. We give an overview of two operational automated scoring systems —one for essay scoring and one for speech scoring— and describe the filtering models they use. Finally, we present an evaluation and analysis of filtering models used for spoken responses in an assessment of language proficiency.

pdf bib
Using exemplar responses for training and evaluating automated speech scoring systems
Anastassia Loukina | Klaus Zechner | James Bruno | Beata Beigman Klebanov
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

Automated scoring engines are usually trained and evaluated against human scores and compared to the benchmark of human-human agreement. In this paper we compare the performance of an automated speech scoring engine using two corpora: a corpus of almost 700,000 randomly sampled spoken responses with scores assigned by one or two raters during operational scoring, and a corpus of 16,500 exemplar responses with scores reviewed by multiple expert raters. We show that the choice of corpus used for model evaluation has a major effect on estimates of system performance with r varying between 0.64 and 0.80. Surprisingly, this is not the case for the choice of corpus for model training: when the training corpus is sufficiently large, the systems trained on different corpora showed almost identical performance when evaluated on the same corpus. We show that this effect is consistent across several learning algorithms. We conclude that evaluating the model on a corpus of exemplar responses if one is available provides additional evidence about system validity; at the same time, investing effort into creating a corpus of exemplar responses for model training is unlikely to lead to a substantial gain in model performance.

pdf bib
Word-Embedding based Content Features for Automated Oral Proficiency Scoring
Su-Youn Yoon | Anastassia Loukina | Chong Min Lee | Matthew Mulholland | Xinhao Wang | Ikkyu Choi
Proceedings of the Third Workshop on Semantic Deep Learning

In this study, we develop content features for an automated scoring system of non-native English speakers’ spontaneous speech. The features calculate the lexical similarity between the question text and the ASR word hypothesis of the spoken response, based on traditional word vector models or word embeddings. The proposed features do not require any sample training responses for each question, and this is a strong advantage since collecting question-specific data is an expensive task, and sometimes even impossible due to concerns about question exposure. We explore the impact of these new features on the automated scoring of two different question types: (a) providing opinions on familiar topics and (b) answering a question about a stimulus material. The proposed features showed statistically significant correlations with the oral proficiency scores, and the combination of new features with the speech-driven features achieved a small but significant further improvement for the latter question type. Further analyses suggested that the new features were effective in assigning more accurate scores for responses with serious content issues.

2017

pdf bib
Building Better Open-Source Tools to Support Fairness in Automated Scoring
Nitin Madnani | Anastassia Loukina | Alina von Davier | Jill Burstein | Aoife Cahill
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

Automated scoring of written and spoken responses is an NLP application that can significantly impact lives especially when deployed as part of high-stakes tests such as the GRE® and the TOEFL®. Ethical considerations require that automated scoring algorithms treat all test-takers fairly. The educational measurement community has done significant research on fairness in assessments and automated scoring systems must incorporate their recommendations. The best way to do that is by making available automated, non-proprietary tools to NLP researchers that directly incorporate these recommendations and generate the analyses needed to help identify and resolve biases in their scoring systems. In this paper, we attempt to provide such a solution.

pdf bib
Speech- and Text-driven Features for Automated Scoring of English Speaking Tasks
Anastassia Loukina | Nitin Madnani | Aoife Cahill
Proceedings of the Workshop on Speech-Centric Natural Language Processing

We consider the automatic scoring of a task for which both the content of the response as well its spoken fluency are important. We combine features from a text-only content scoring system originally designed for written responses with several categories of acoustic features. Although adding any single category of acoustic features to the text-only system on its own does not significantly improve performance, adding all acoustic features together does yield a small but significant improvement. These results are consistent for responses to open-ended questions and to questions focused on some given source material.

pdf bib
Continuous fluency tracking and the challenges of varying text complexity
Beata Beigman Klebanov | Anastassia Loukina | John Sabatini | Tenaha O’Reilly
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper is a preliminary report on using text complexity measurement in the service of a new educational application. We describe a reading intervention where a child takes turns reading a book aloud with a virtual reading partner. Our ultimate goal is to provide meaningful feedback to the parent or the teacher by continuously tracking the child’s improvement in reading fluency. We show that this would not be a simple endeavor, due to an intricate relationship between text complexity from the point of view of comprehension and reading rate.

pdf bib
A Large Scale Quantitative Exploration of Modeling Strategies for Content Scoring
Nitin Madnani | Anastassia Loukina | Aoife Cahill
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We explore various supervised learning strategies for automated scoring of content knowledge for a large corpus of 130 different content-based questions spanning four subject areas (Science, Math, English Language Arts, and Social Studies) and containing over 230,000 responses scored by human raters. Based on our analyses, we provide specific recommendations for content scoring. These are based on patterns observed across multiple questions and assessments and are, therefore, likely to generalize to other scenarios and prove useful to the community as automated content scoring becomes more popular in schools and classrooms.

2016

pdf bib
Textual complexity as a predictor of difficulty of listening items in language proficiency tests
Anastassia Loukina | Su-Youn Yoon | Jennifer Sakano | Youhua Wei | Kathy Sheehan
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper we explore to what extent the difficulty of listening items in an English language proficiency test can be predicted by the textual properties of the prompt. We show that a system based on multiple text complexity features can predict item difficulty for several different item types and for some items achieves higher accuracy than human estimates of item difficulty.

pdf bib
Automated scoring across different modalities
Anastassia Loukina | Aoife Cahill
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

2015

pdf bib
Feature selection for automated speech scoring
Anastassia Loukina | Klaus Zechner | Lei Chen | Michael Heilman
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

2014

pdf bib
Automatic evaluation of spoken summaries: the case of language assessment
Anastassia Loukina | Klaus Zechner | Lei Chen
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications