Maja Stahl

2025

ArgInstruct: Specialized Instruction Fine-Tuning for Computational Argumentation
Maja Stahl | Timon Ziegenbein | Joonsuk Park | Henning Wachsmuth
Findings of the Association for Computational Linguistics: ACL 2025

Training large language models (LLMs) to follow instructions has significantly enhanced their ability to tackle unseen tasks. However, despite their strong generalization capabilities, instruction-following LLMs encounter difficulties when dealing with tasks that require domain knowledge. This work introduces a specialized instruction fine-tuning for the domain of computational argumentation (CA). The goal is to enable an LLM to effectively tackle any unseen CA tasks while preserving its generalization capabilities. Reviewing existing CA research, we crafted natural language instructions for 105 CA tasks to this end. On this basis, we developed a CA-specific benchmark for LLMs that allows for a comprehensive evaluation of LLMs’ capabilities in solving various CA tasks. We synthesized 52k CA-related instructions, adapting the self-instruct process to train a CA-specialized instruction-following LLM. Our experiments suggest that CA-specialized instruction fine-tuning significantly enhances the LLM on both seen and unseen CA tasks. At the same time, performance on the general NLP tasks of the SuperNI benchmark remains stable.

2024

pdf bib abs

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation
Maja Stahl | Leon Biermann | Andreas Nehring | Henning Wachsmuth
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.

pdf bib abs

Reference-guided Style-Consistent Content Transfer
Wei-Fan Chen | Milad Alshomary | Maja Stahl | Khalid Al-Khatib | Benno Stein | Henning Wachsmuth
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we introduce the task of style-consistent content transfer, which concerns modifying a text’s content based on a provided reference statement while preserving its original style. We approach the task by employing multi-task learning to ensure that the modified text meets three important conditions: reference faithfulness, style adherence, and coherence. In particular, we train three independent classifiers for each condition. During inference, these classifiers are used to determine the best modified text variant. Our evaluation, conducted on hotel reviews and news articles, compares our approach with sequence-to-sequence and error correction baselines. The results demonstrate that our approach reasonably generates text satisfying all three conditions. In subsequent analyses, we highlight the strengths and limitations of our approach, providing valuable insights for future research directions.

pdf bib abs

A School Student Essay Corpus for Analyzing Interactions of Argumentative Structure and Quality
Maja Stahl | Nadine Michel | Sebastian Kilsbach | Julian Schmidtke | Sara Rezat | Henning Wachsmuth
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Learning argumentative writing is challenging. Besides writing fundamentals such as syntax and grammar, learners must select and arrange argument components meaningfully to create high-quality essays. To support argumentative writing computationally, one step is to mine the argumentative structure. When combined with automatic essay scoring, interactions of the argumentative structure and quality scores can be exploited for comprehensive writing support. Although studies have shown the usefulness of using information about the argumentative structure for essay scoring, no argument mining corpus with ground-truth essay quality annotations has been published yet. Moreover, none of the existing corpora contain essays written by school students specifically. To fill this research gap, we present a German corpus of 1,320 essays from school students of two age groups. Each essay has been manually annotated for argumentative structure and quality on multiple levels of granularity. We propose baseline approaches to argument mining and essay scoring, and we analyze interactions between both tasks, thereby laying the ground for quality-oriented argumentative writing support.

2023

pdf bib abs

Identifying Feedback Types to Augment Feedback Comment Generation
Maja Stahl | Henning Wachsmuth
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

In the context of language learning, feedback comment generation is the task of generating hints or explanatory notes for learner texts that help understand why a part of text is erroneous. This paper presents our approach to the Feedback Comment Generation Shared Task, collocated with the 16th International Natural Language Generation Conference (INLG 2023). The approach augments the generation of feedback comments by a self-supervised identification of feedback types in a multitasklearning setting. Within the shared task, other approaches performed more effective, yet the combined modeling of feedback type classification and feedback comment generation is superior to performing eedback generation only.

pdf bib abs

Mind the Gap: Automated Corpus Creation for Enthymeme Detection and Reconstruction in Learner Arguments
Maja Stahl | Nick Düsterhus | Mei-Hua Chen | Henning Wachsmuth
Findings of the Association for Computational Linguistics: EMNLP 2023

Writing strong arguments can be challenging for learners. It requires to select and arrange multiple argumentative discourse units (ADUs) in a logical and coherent way as well as to decide which ADUs to leave implicit, so called enthymemes. However, when important ADUs are missing, readers might not be able to follow the reasoning or understand the argument’s main point. This paper introduces two new tasks for learner arguments: to identify gaps in arguments (enthymeme detection) and to fill such gaps (enthymeme reconstruction). Approaches to both tasks may help learners improve their argument quality. We study how corpora for these tasks can be created automatically by deleting ADUs from an argumentative text that are central to the argument and its quality, while maintaining the text’s naturalness. Based on the ICLEv3 corpus of argumentative learner essays, we create 40,089 argument instances for enthymeme detection and reconstruction. Through manual studies, we provide evidence that the proposed corpus creation process leads to the desired quality reduction, and results in arguments that are similarly natural to those written by learners. Finally, first baseline approaches to enthymeme detection and reconstruction demonstrate the corpus’ usefulness.

2022

pdf bib abs

Argument Novelty and Validity Assessment via Multitask and Transfer Learning
Milad Alshomary | Maja Stahl
Proceedings of the 9th Workshop on Argument Mining

An argument is a constellation of premises reasoning towards a certain conclusion. The automatic generation of conclusions is becoming a very prominent task, raising the need for automatic measures to assess the quality of these generated conclusions. The SharedTask at the 9th Workshop on Argument Mining proposes a new task to assess the novelty and validity of a conclusion given a set of premises. In this paper, we present a multitask learning approach that transfers the knowledge learned from the natural language inference task to the tasks at hand. Evaluation results indicate the importance of both knowledge transfer and joint learning, placing our approach in the fifth place with strong results compared to baselines.

pdf bib abs

To Prefer or to Choose? Generating Agency and Power Counterfactuals Jointly for Gender Bias Mitigation
Maja Stahl | Maximilian Spliethöver | Henning Wachsmuth
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Gender bias may emerge from an unequal representation of agency and power, for example, by portraying women frequently as passive and powerless (“She accepted her future”) and men as proactive and powerful (“He chose his future”). When language models learn from respective texts, they may reproduce or even amplify the bias. An effective way to mitigate bias is to generate counterfactual sentences with opposite agency and power to the training. Recent work targeted agency-specific verbs from a lexicon to this end. We argue that this is insufficient, due to the interaction of agency and power and their dependence on context. In this paper, we thus develop a new rewriting model that identifies verbs with the desired agency and power in the context of the given sentence. The verbs’ probability is then boosted to encourage the model to rewrite both connotations jointly. According to automatic metrics, our model effectively controls for power while being competitive in agency to the state of the art. In our main evaluation, human annotators favored its counterfactuals in terms of both connotations, also deeming its meaning preservation better.