Wenzhi Wang

2025

Counter-arguments (CAs) are a good means to improve the critical-thinking skills of learners, especially given that one has to thoroughly consider the logic of initial arguments (IA) when composing their CA. Although several tasks have been created for identifying the logical structure of CAs, no prior work has focused on capturing multiple interpretations of logical structures due to their complexity. In this work, we create CALSA+, a dataset consisting of 134 CAs annotated with 13 logical predicate questions. CALSA+ contains 1,742 instances annotated by 3 expert annotators (5,226 total annotations) with good agreement (Krippendorff 𝛼=0.46). Using CALSA+, we train a model with Reinforcement Learning with Verifiable Rewards (RLVR) to identify multiple logical interpretations and show that models trained with RLVR can perform on par with much bigger proprietary models. Our work is the first to attempt to annotate all the interpretations of logical structure on top of CAs. We publicly release our dataset to facilitate research in CA logical structure identification.

pdf bib abs

FOCUS: A Benchmark for Targeted Socratic Question Generation via Source-Span Grounding
Surawat Pothong | Machi Shimmei | Naoya Inoue | Paul Reisert | Ana Brassard | Wenzhi Wang | Shoichi Naito | Jungmin Choi | Kentaro Inui
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

We present FOCUS, a benchmark and task setting for Socratic question generation that delivers more informative and targeted feedback to learners. Unlike prior datasets, which rely on broad typologies and lack grounding in the source text, FOCUS introduces a new formulation: each Socratic question is paired with a fine-grained, 11-type typology and an explicit source span from the argument it targets. This design supports clearer, more actionable feedback and facilitates interpretable model evaluation. FOCUS includes 440 annotated instances with moderate partial-match agreement, establishing it as a reliable benchmark. Baseline experiments with representative state-of-the-art models reveal, through detailed error analysis, that even strong models struggle with span selection and context-sensitive categories. An extension study on the LogicClimate dataset further confirms the generalizability of the task and annotation framework. FOCUS sets a new standard for pedagogically grounded and informative Socratic question generation.

2024

pdf bib abs

Prior research in computational argumentation has mainly focused on scoring the quality of arguments, with less attention on explicating logical errors. In this work, we introduce four sets of explainable templates for common informal logical fallacies designed to explicate a fallacy’s implicit logic. Using our templates, we conduct an annotation study on top of 400 fallacious arguments taken from LOGIC dataset and achieve a high agreement score (Krippendorf’s 𝛼 of 0.54) and reasonable coverage 83%. Finally, we conduct an experiment for detecting the structure of fallacies and discover that state-of-the-art language models struggle with detecting fallacy templates (0.47 accuracy). To facilitate research on fallacies, we make our dataset and guidelines publicly available.

pdf bib

2023

pdf bib abs

The use of argumentation in education has shown improvement in students’ critical thinking skills, and computational models for argumentation have been developed to further assist this process. Although these models are useful for evaluating the quality of an argument, they often cannot explain why a particular argument score was predicted, i.e., why the argument is good or bad, which makes it difficult to provide constructive feedback to users, e.g., students, so that they can strengthen their critical thinking skills. In this survey, we explore current NLP feedback systems by categorizing each into four important dimensions of feedback (Richness, Visualization, Interactivity and Personalization). We discuss limitations for each dimension and provide suggestions to enhance the power of feedback and explanations to ultimately improve user critical thinking skills.

Co-authors

Farjana Sultana Mim 1

Keshav Singh 1

Kenshi Yamaguchi 1

Venues

WS1

Fix author