Kevin P. Yancey

2024

pdf bib abs
BERT-IRT: Accelerating Item Piloting with BERT Embeddings and Explainable IRT Models
Kevin P. Yancey | Andrew Runge | Geoffrey LaFlair | Phoebe Mulcaire
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Estimating item parameters (e.g., the difficulty of a question) is an important part of modern high-stakes tests. Conventional methods require lengthy pilots to collect response data from a representative population of test-takers. The need for these pilots limit item bank size and how often those item banks can be refreshed, impacting test security, while increasing costs needed to support the test and taking up the test-taker’s valuable time. Our paper presents a novel explanatory item response theory (IRT) model, BERT-IRT, that has been used on the Duolingo English Test (DET), a high-stakes test of English, to reduce the length of pilots by a factor of 10. Our evaluation shows how the model uses BERT embeddings and engineered NLP features to accelerate item piloting without sacrificing criterion validity or reliability.

pdf bib abs
Detecting LLM-Assisted Cheating on Open-Ended Writing Tasks on Language Proficiency Tests
Chenhao Niu | Kevin P. Yancey | Ruidong Liu | Mirza Basim Baig | André Kenji Horie | James Sharpnack
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

The high capability of recent Large Language Models (LLMs) has led to concerns about possible misuse as cheating assistants in open-ended writing tasks in assessments. Although various detecting methods have been proposed, most of them have not been evaluated on or optimized for real-world samples from LLM-assisted cheating, where the generated text is often copy-typed imperfectly by the test-taker. In this paper, we present a framework for training LLM-generated text detectors that can effectively detect LLM-generated samples after being copy-typed. We enhance the existing transformer-based classifier training process with contrastive learning on constructed pairwise data and self-training on unlabeled data, and evaluate the improvements on a real-world dataset from the Duolingo English Test (DET), a high-stakes online English proficiency test. Our experiments demonstrate that the improved model outperforms the original transformer-based classifier and other baselines.

2023

pdf bib abs
Rating Short L2 Essays on the CEFR Scale with GPT-4
Kevin P. Yancey | Geoffrey Laflair | Anthony Verardi | Jill Burstein
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Essay scoring is a critical task used to evaluate second-language (L2) writing proficiency on high-stakes language assessments. While automated scoring approaches are mature and have been around for decades, human scoring is still considered the gold standard, despite its high costs and well-known issues such as human rater fatigue and bias. The recent introduction of large language models (LLMs) brings new opportunities for automated scoring. In this paper, we evaluate how well GPT-3.5 and GPT-4 can rate short essay responses written by L2 English learners on a high-stakes language assessment, computing inter-rater agreement with human ratings. Results show that when calibration examples are provided, GPT-4 can perform almost as well as modern Automatic Writing Evaluation (AWE) methods, but agreement with human ratings can vary depending on the test-taker’s first language (L1).

2022

In this paper, we present the FABRA: readability toolkit based on the aggregation of a large number of readability predictor variables. The toolkit is implemented as a service-oriented architecture, which obviates the need for installation, and simplifies its integration into other projects. We also perform a set of experiments to show which features are most predictive on two different corpora, and how the use of aggregators improves performance over standard feature-based readability prediction. Our experiments show that, for the explored corpora, the most important predictors for native texts are measures of lexical diversity, dependency counts and text coherence, while the most important predictors for foreign texts are syntactic variables illustrating language development, as well as features linked to lexical sophistication. FABRA: have the potential to support new research on readability assessment for French.

2021

pdf bib abs
Jump-Starting Item Parameters for Adaptive Language Tests
Arya D. McCarthy | Kevin P. Yancey | Geoffrey T. LaFlair | Jesse Egbert | Manqian Liao | Burr Settles
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

A challenge in designing high-stakes language assessments is calibrating the test item difficulties, either a priori or from limited pilot test data. While prior work has addressed ‘cold start’ estimation of item difficulties without piloting, we devise a multi-task generalized linear model with BERT features to jump-start these estimates, rapidly improving their quality with as few as 500 test-takers and a small sample of item exposures (≈6 each) from a large item bank (≈4,000 items). Our joint model provides a principled way to compare test-taker proficiency, item difficulty, and language proficiency frameworks like the Common European Framework of Reference (CEFR). This also enables new item difficulty estimates without piloting them first, which in turn limits item exposure and thus enhances test item security. Finally, using operational data from the Duolingo English Test, a high-stakes English proficiency test, we find that the difficulty estimates derived using this method correlate strongly with lexico-grammatical features that correlate with reading complexity.