LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches

Imran Chamieh, Torsten Zesch, Klaus Giebermann


Abstract
In this work, we investigate the potential of Large Language Models (LLMs) for automated short answer scoring. We test zero-shot and few-shot settings, and compare with fine-tuned models and a supervised upper-bound, across three diverse datasets. Our results, in zero-shot and few-shot settings, show that LLMs perform poorly in these settings: LLMs have difficulty with tasks that require complex reasoning or domain-specific knowledge. While the models show promise on general knowledge tasks. The fine-tuned model come close to the supervised results but are still not feasible for application, highlighting potential overfitting issues. Overall, our study highlights the challenges and limitations of LLMs in short answer scoring and indicates that there currently seems to be no basis for applying LLMs for short answer scoring.
Anthology ID:
2024.bea-1.25
Volume:
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Ekaterina Kochmar, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anaïs Tack, Victoria Yaneva, Zheng Yuan
Venue:
BEA
SIG:
SIGEDU
Publisher:
Association for Computational Linguistics
Note:
Pages:
309–315
Language:
URL:
https://aclanthology.org/2024.bea-1.25
DOI:
Bibkey:
Cite (ACL):
Imran Chamieh, Torsten Zesch, and Klaus Giebermann. 2024. LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 309–315, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches (Chamieh et al., BEA 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.bea-1.25.pdf