Yuning Ding

2025

Increasing the Generalizability of Similarity-Based Essay Scoring Through Cross-Prompt Training
Marie Bexte | Yuning Ding | Andrea Horbach
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

In this paper, we address generic essay scoring, i.e., the use of training data from one writing task to score data from a different task. We approach this by generalizing a similarity-based essay scoring method (Xie et al., 2022) to learning from texts that are written in response to a mixture of different prompts. In our experiments, we compare within-prompt and cross-prompt performance on two large datasets (ASAP and PERSUADE). We combine different amounts of prompts in the training data and show that our generalized method substantially improves cross-prompt performance, especially when an increasing number of prompts is used to form the training data. In the most extreme case, this leads to more than double the performance, increasing QWK from .26 to .55.

pdf bib abs

Don’t Score too Early! Evaluating Argument Mining Models on Incomplete Essays
Nils-Jonathan Schaller | Yuning Ding | Thorben Jansen | Andrea Horbach
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Students’ argumentative writing benefits from receiving automated feedback, particularly throughout the writing process. While Argument Mining (AM) technology shows promise for delivering automated feedback on argumentative structures, existing systems are frequently trained on completed essays, providing rich context information and raising concerns about their usefulness for offering writing support on incomplete texts during the writing process. This study evaluates the robustness of AM algorithms on artificially fragmented learner texts from two large-scale corpora of secondary school essays: the German DARIUS corpus and the English PERSUADE corpus. Our analysis reveals that token-level sequence-tagging methods, while highly effective on complete essays, suffer significantly when context is limited or misleading. Conversely, sentence-level classifiers maintain relative stability under such conditions. We show that deliberately training AM models on fragmented input substantially mitigates these context-related weaknesses, enabling AM systems to support dynamic educational writing scenarios better.

pdf bib abs

FEAT-writing: An Interactive Training System for Argumentative Writing
Yuning Ding | Franziska Wehrhahn | Andrea Horbach
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

Recent developments in Natural Language Processing (NLP) for argument mining offer new opportunities to analyze the argumentative units (AUs) in student essays. These advancements can be leveraged to provide automatically generated feedback and exercises for students engaging in online argumentative essay writing practice. Writing standards for both native English speakers (L1) and English-as-a-foreign-language (L2) learners require students to understand formal essay structures and different AUs. To address this need, we developed FEAT-writing (Feedback and Exercises for Argumentative Training in writing), an interactive system that provides students with automatically generated exercises and distinct feedback on their argumentative writing. In a preliminary evaluation involving 346 students, we assessed the impact of six different automated feedback types on essay quality, with results showing general improvements in writing after receiving feedback from the system.

2024

pdf bib abs

Fairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on German Learner Essays from Secondary Education
Nils-Jonathan Schaller | Yuning Ding | Andrea Horbach | Jennifer Meyer | Thorben Jansen
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Pursuing educational equity, particularly in writing instruction, requires that all students receive fair (i.e., accurate and unbiased) assessment and feedback on their texts. Automated Essay Scoring (AES) algorithms have so far focused on optimizing the mean accuracy of their scores and paid less attention to fair scores for all subgroups, although research shows that students receive unfair scores on their essays in relation to demographic variables, which in turn are related to their writing competence. We add to the literature arguing that AES should also optimize for fairness by presenting insights on the fairness of scoring algorithms on a corpus of learner texts in the German language and introduce the novelty of examining fairness on psychological and demographic differences in addition to demographic differences. We compare shallow learning, deep learning, and large language models with full and skewed subsets of training data to investigate what is needed for fair scoring. The results show that training on a skewed subset of higher and lower cognitive ability students shows no bias but very low accuracy for students outside the training set. Our results highlight the need for specific training data on all relevant user groups, not only for demographic background variables but also for cognitive abilities as psychological student characteristics.

pdf bib abs

Transfer Learning of Argument Mining in Student Essays
Yuning Ding | Julian Lohmann | Nils-Jonathan Schaller | Thorben Jansen | Andrea Horbach
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This paper explores the transferability of a cross-prompt argument mining model trained on argumentative essays authored by native English-speaking learners (EN-L1) across educational contexts and languages. Specifically, the adaptability of a multilingual transformer model is assessed through its application to comparable argumentative essays authored by English-as-a-foreign-language learners (EN-L2) for context transfer, and a dataset composed of essays written by native German learners (DE) for both language and task transfer. To separate language effects from educational context effects, we also perform experiments on a machine-translated version of the German dataset (DE-MT). Our findings demonstrate that, even under zero-shot conditions, a model trained on native English speakers exhibits satisfactory performance on the EN-L2/DE datasets. Machine translation does not substantially enhance this performance, suggesting that distinct writing styles across educational contexts impact performance more than language differences.

pdf bib abs

DARIUS: A Comprehensive Learner Corpus for Argument Mining in German-Language Essays
Nils-Jonathan Schaller | Andrea Horbach | Lars Ingver Höft | Yuning Ding | Jan Luca Bahr | Jennifer Meyer | Thorben Jansen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we present the DARIUS (Digital Argumentation Instruction for Science) corpus for argumentation quality on 4589 essays written by 1839 German secondary school students. The corpus is annotated according to a fine-grained annotation scheme, ranging from a broader perspective like content zones, to more granular features like argumentation coverage/reach and argumentative discourse units like claims and warrants. The features have inter-annotator agreements up to 0.83 Krippendorff’s α. The corpus and dataset are publicly available for further research in argument mining.

pdf bib abs

When Argumentation Meets Cohesion: Enhancing Automatic Feedback in Student Writing
Yuning Ding | Omid Kashefi | Swapna Somasundaran | Andrea Horbach
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we investigate the role of arguments in the automatic scoring of cohesion in argumentative essays. The feature analysis reveals that in argumentative essays, the lexical cohesion between claims is more important to the overall cohesion, while the evidence is expected to be diverse and divergent. Our results show that combining features related to argument segments and cohesion features improves the performance of the automatic cohesion scoring model trained on a transformer. The cohesion score is also learned more accurately in a multi-task learning process by adding the automatic segmentation of argumentative elements as an auxiliary task. Our findings contribute to both the understanding of cohesion in argumentative writing and the development of automatic feedback.

2023

pdf bib abs

CATALPA_EduNLP at PragTag-2023
Yuning Ding | Marie Bexte | Andrea Horbach
Proceedings of the 10th Workshop on Argument Mining

This paper describes our contribution to the PragTag-2023 Shared Task. We describe and compare different approaches based on sentence classification, sentence similarity, and sequence tagging. We find that a BERT-based sentence labeling approach integrating positional information outperforms both sequence tagging and SBERT-based sentence classification. We further provide analyses highlighting the potential of combining different approaches.

pdf bib abs

Score It All Together: A Multi-Task Learning Study on Automatic Scoring of Argumentative Essays
Yuning Ding | Marie Bexte | Andrea Horbach
Findings of the Association for Computational Linguistics: ACL 2023

When scoring argumentative essays in an educational context, not only the presence or absence of certain argumentative elements but also their quality is important. On the recently published student essay dataset PERSUADE, we first show that the automatic scoring of argument quality benefits from additional information about context, writing prompt and argument type. We then explore the different combinations of three tasks: automated span detection, type and quality prediction. Results show that a multi-task learning approach combining the three tasks outperforms sequential approaches that first learn to segment and then predict the quality/type of a segment.

pdf bib

Sequence Tagging in EFL Email Texts as Feedback for Language Learners
Yuning Ding | Ruth Trüb | Johanna Fleckenstein | Stefan Keller | Andrea Horbach
Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning

2022

pdf bib abs

Don’t Drop the Topic - The Role of the Prompt in Argument Identification in Student Writing
Yuning Ding | Marie Bexte | Andrea Horbach
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

In this paper, we explore the role of topic information in student essays from an argument mining perspective. We cluster a recently released corpus through topic modeling into prompts and train argument identification models on different data settings. Results show that, given the same amount of training data, prompt-specific training performs better than cross-prompt training. However, the advantage can be overcome by introducing large amounts of cross-prompt training data.

2020

pdf bib abs

Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels
Yuning Ding | Andrea Horbach | Torsten Zesch
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

In this paper, we analyse the challenges of Chinese content scoring in comparison to English. As a review of prior work for Chinese content scoring shows a lack of open-access data in the field, we present two short-answer data sets for Chinese. The Chinese Educational Short Answers data set (CESA) contains 1800 student answers for five science-related questions. As a second data set, we collected ASAP-ZH with 942 answers by re-using three existing prompts from the ASAP data set. We adapt a state-of-the-art content scoring system for Chinese and evaluate it in several settings on these data sets. Results show that features on lower segmentation levels such as character n-grams tend to have better performance than features on token level.

pdf bib abs

Don’t take “nswvtnvakgxpm” for an answer –The surprising vulnerability of automatic content scoring systems to adversarial input
Yuning Ding | Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch
Proceedings of the 28th International Conference on Computational Linguistics

Automatic content scoring systems are widely used on short answer tasks to save human effort. However, the use of these systems can invite cheating strategies, such as students writing irrelevant answers in the hopes of gaining at least partial credit. We generate adversarial answers for benchmark content scoring datasets based on different methods of increasing sophistication and show that even simple methods lead to a surprising decrease in content scoring performance. As an extreme example, up to 60% of adversarial answers generated from random shuffling of words in real answers are accepted by a state-of-the-art scoring system. In addition to analyzing the vulnerabilities of content scoring systems, we examine countermeasures such as adversarial training and show that these measures improve system robustness against adversarial answers considerably but do not suffice to completely solve the problem.

2017

pdf bib abs

Fine-grained essay scoring of a complex writing task for native speakers
Andrea Horbach | Dirk Scholten-Akoun | Yuning Ding | Torsten Zesch
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Automatic essay scoring is nowadays successfully used even in high-stakes tests, but this is mainly limited to holistic scoring of learner essays. We present a new dataset of essays written by highly proficient German native speakers that is scored using a fine-grained rubric with the goal to provide detailed feedback. Our experiments with two state-of-the-art scoring systems (a neural and a SVM-based one) show a large drop in performance compared to existing datasets. This demonstrates the need for such datasets that allow to guide research on more elaborate essay scoring methods.

pdf bib abs

The Influence of Spelling Errors on Content Scoring Performance
Andrea Horbach | Yuning Ding | Torsten Zesch
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

Spelling errors occur frequently in educational settings, but their influence on automatic scoring is largely unknown. We therefore investigate the influence of spelling errors on content scoring performance using the example of the ASAP corpus. We conduct an annotation study on the nature of spelling errors in the ASAP dataset and utilize these finding in machine learning experiments that measure the influence of spelling errors on automatic content scoring. Our main finding is that scoring methods using both token and character n-gram features are robust against spelling errors up to the error frequency in ASAP.