Shoichi Naito

2025

LLM DEBATE OPPONENT : Counter-argument Generation focusing on Implicit and Critical Premises
Taisei Ozaki | Chihiro Nakagawa | Naoya Inoue | Shoichi Naito | Kenshi Yamaguchi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Debate education fosters critical thinking skills but often incurs high human costs. Recent advancements in Large Language Models (LLMs) show promise in automating counter-argument generation. However, it remains unclear how best to guide LLMs to target both implicit and critical premises. In this study, we systematically compare multi-step and one-step generation methods for counter-arguments across 100 debate topics. Our findings reveal that one-step approaches consistently outperform multi-step pipelines, owing to their better grasp of the “motion spirit,” minimized propagation of hallucinations, and avoidance of challenging intermediate tasks. Among premise-targeting methods, a one-step strategy that accounts for both implicit and explicit premises—Generated and Targeted Premise Attack (GTG)—emerges as the strongest performer in expert and automated evaluations. These results highlight the value of direct, integrated prompts for leveraging LLMs in complex argumentation tasks and offer insights for developing more effective automated debate agents.

pdf bib abs

Counter-arguments (CAs) are a good means to improve the critical-thinking skills of learners, especially given that one has to thoroughly consider the logic of initial arguments (IA) when composing their CA. Although several tasks have been created for identifying the logical structure of CAs, no prior work has focused on capturing multiple interpretations of logical structures due to their complexity. In this work, we create CALSA+, a dataset consisting of 134 CAs annotated with 13 logical predicate questions. CALSA+ contains 1,742 instances annotated by 3 expert annotators (5,226 total annotations) with good agreement (Krippendorff 𝛼=0.46). Using CALSA+, we train a model with Reinforcement Learning with Verifiable Rewards (RLVR) to identify multiple logical interpretations and show that models trained with RLVR can perform on par with much bigger proprietary models. Our work is the first to attempt to annotate all the interpretations of logical structure on top of CAs. We publicly release our dataset to facilitate research in CA logical structure identification.

pdf bib abs

FOCUS: A Benchmark for Targeted Socratic Question Generation via Source-Span Grounding
Surawat Pothong | Machi Shimmei | Naoya Inoue | Paul Reisert | Ana Brassard | Wenzhi Wang | Shoichi Naito | Jungmin Choi | Kentaro Inui
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

We present FOCUS, a benchmark and task setting for Socratic question generation that delivers more informative and targeted feedback to learners. Unlike prior datasets, which rely on broad typologies and lack grounding in the source text, FOCUS introduces a new formulation: each Socratic question is paired with a fine-grained, 11-type typology and an explicit source span from the argument it targets. This design supports clearer, more actionable feedback and facilitates interpretable model evaluation. FOCUS includes 440 annotated instances with moderate partial-match agreement, establishing it as a reliable benchmark. Baseline experiments with representative state-of-the-art models reveal, through detailed error analysis, that even strong models struggle with span selection and context-sensitive categories. An extension study on the LogicClimate dataset further confirms the generalizability of the task and annotation framework. FOCUS sets a new standard for pedagogically grounded and informative Socratic question generation.

2024

pdf bib

pdf bib abs

Prior research in computational argumentation has mainly focused on scoring the quality of arguments, with less attention on explicating logical errors. In this work, we introduce four sets of explainable templates for common informal logical fallacies designed to explicate a fallacy’s implicit logic. Using our templates, we conduct an annotation study on top of 400 fallacious arguments taken from LOGIC dataset and achieve a high agreement score (Krippendorf’s 𝛼 of 0.54) and reasonable coverage 83%. Finally, we conduct an experiment for detecting the structure of fallacies and discover that state-of-the-art language models struggle with detecting fallacy templates (0.47 accuracy). To facilitate research on fallacies, we make our dataset and guidelines publicly available.

2023

pdf bib abs

The use of argumentation in education has shown improvement in students’ critical thinking skills, and computational models for argumentation have been developed to further assist this process. Although these models are useful for evaluating the quality of an argument, they often cannot explain why a particular argument score was predicted, i.e., why the argument is good or bad, which makes it difficult to provide constructive feedback to users, e.g., students, so that they can strengthen their critical thinking skills. In this survey, we explore current NLP feedback systems by categorizing each into four important dimensions of feedback (Richness, Visualization, Interactivity and Personalization). We discuss limitations for each dimension and provide suggestions to enhance the power of feedback and explanations to ultimately improve user critical thinking skills.

2022

pdf bib abs

LPAttack: A Feasible Annotation Scheme for Capturing Logic Pattern of Attacks in Arguments
Farjana Sultana Mim | Naoya Inoue | Shoichi Naito | Keshav Singh | Kentaro Inui
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In argumentative discourse, persuasion is often achieved by refuting or attacking others’ arguments. Attacking an argument is not always straightforward and often consists of complex rhetorical moves in which arguers may agree with a logic of an argument while attacking another logic. Furthermore, an arguer may neither deny nor agree with any logics of an argument, instead ignore them and attack the main stance of the argument by providing new logics and presupposing that the new logics have more value or importance than the logics presented in the attacked argument. However, there are no studies in computational argumentation that capture such complex rhetorical moves in attacks or the presuppositions or value judgments in them. To address this gap, we introduce LPAttack, a novel annotation scheme that captures the common modes and complex rhetorical moves in attacks along with the implicit presuppositions and value judgments. Our annotation study shows moderate inter-annotator agreement, indicating that human annotation for the proposed scheme is feasible. We publicly release our annotated corpus and the annotation guidelines.

pdf bib abs

Providing feedback on the argumentation of the learner is essential for developing critical thinking skills, however, it requires a lot of time and effort. To mitigate the overload on teachers, we aim to automate a process of providing feedback, especially giving diagnostic comments which point out the weaknesses inherent in the argumentation. It is recommended to give specific diagnostic comments so that learners can recognize the diagnosis without misinterpretation. However, it is not obvious how the task of providing specific diagnostic comments should be formulated. We present a formulation of the task as template selection and slot filling to make an automatic evaluation easier and the behavior of the model more tractable. The key to the formulation is the possibility of creating a template set that is sufficient for practical use. In this paper, we define three criteria that a template set should satisfy: expressiveness, informativeness, and uniqueness, and verify the feasibility of creating a template set that satisfies these criteria as a first trial. We will show that it is feasible through an annotation study that converts diagnostic comments given in a text to a template format. The corpus used in the annotation study is publicly available.

pdf bib abs

IRAC: A Domain-Specific Annotated Corpus of Implicit Reasoning in Arguments
Keshav Singh | Naoya Inoue | Farjana Sultana Mim | Shoichi Naito | Kentaro Inui
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The task of implicit reasoning generation aims to help machines understand arguments by inferring plausible reasonings (usually implicit) between argumentative texts. While this task is easy for humans, machines still struggle to make such inferences and deduce the underlying reasoning. To solve this problem, we hypothesize that as human reasoning is guided by innate collection of domain-specific knowledge, it might be beneficial to create such a domain-specific corpus for machines. As a starting point, we create the first domain-specific resource of implicit reasonings annotated for a wide range of arguments, which can be leveraged to empower machines with better implicit reasoning generation ability. We carefully design an annotation framework to collect them on a large scale through crowdsourcing and show the feasibility of creating a such a corpus at a reasonable cost and high-quality. Our experiments indicate that models trained with domain-specific implicit reasonings significantly outperform domain-general models in both automatic and human evaluations. To facilitate further research towards implicit reasoning generation in arguments, we present an in-depth analysis of our corpus and crowdsourcing methodology, and release our materials (i.e., crowdsourcing guidelines and domain-specific resource of implicit reasonings).

2021

pdf bib abs

Exploring Methodologies for Collecting High-Quality Implicit Reasoning in Arguments
Keshav Singh | Farjana Sultana Mim | Naoya Inoue | Shoichi Naito | Kentaro Inui
Proceedings of the 8th Workshop on Argument Mining

Annotation of implicit reasoning (i.e., warrant) in arguments is a critical resource to train models in gaining deeper understanding and correct interpretation of arguments. However, warrants are usually annotated in unstructured form, having no restriction on their lexical structure which sometimes makes it difficult to interpret how warrants relate to any of the information given in claim and premise. Moreover, assessing and determining better warrants from the large variety of reasoning patterns of unstructured warrants becomes a formidable task. Therefore, in order to annotate warrants in a more interpretative and restrictive way, we propose two methodologies to annotate warrants in a semi-structured form. To the best of our knowledge, we are the first to show how such semi-structured warrants can be annotated on a large scale via crowdsourcing. We demonstrate through extensive quality evaluation that our methodologies enable collecting better quality warrants in comparison to unstructured annotations. To further facilitate research towards the task of explicating warrants in arguments, we release our materials publicly (i.e., crowdsourcing guidelines and collected warrants).