An-Zi Yen

2026

MathEDU: Feedback Generation on Problem-Solving Processes for Mathematical Learning Support
Wei-Ling Hsu | Yu-Chien Tang | An-Zi Yen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The increasing reliance on Large Language Models (LLMs) across various domains extends to education, where students progressively use generative AI as a tool for learning. While prior work has examined LLMs’ mathematical ability, their reliability in grading authentic student problem-solving processes and delivering effective feedback remains underexplored. This study introduces MathEDU, a dataset consisting of student problem-solving processes in mathematics and corresponding teacher-written feedback. We systematically evaluate the reliability of various models across three hierarchical tasks: answer correctness classification, error identification, and feedback generation. Experimental results show that fine-tuning strategies effectively improve performance in classifying correctness and locating erroneous steps. However, the generated feedback across models shows a considerable gap from teacher-written feedback. Critically, the generated feedback is often verbose and fails to provide targeted explanations for the student’s underlying misconceptions. This emphasizes the urgent need for trustworthy and pedagogy-aware AI feedback in education.

pdf bib abs

Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen | Chung-Chi Chen | An-Zi Yen
Findings of the Association for Computational Linguistics: EACL 2026

Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks, with larger models performing better but at higher computational costs. We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates. By assessing a model’s confidence in handling the task and response accuracy, tasks that are likely to be solved correctly are retained, while more uncertain or complex cases are delegated to a larger model, ensuring reliability while minimizing computation. Specifically, we evaluate a model’s likelihood of knowing the correct answer and the probability that its response is accurate.Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20% to 40%. When applied to GPT-4o API calls, it reduces token usage by approximately 60%, further improving cost efficiency. These findings indicate the potential of confidence-based model selection to enhance real-world LLM deployment, particularly in resource-constrained settings such as edge devices and commercial API applications.

pdf bib abs

ESG-KG: A Multi-modal Knowledge Graph System for Automated Compliance Assessment
Li-Yang Chang | Chih-Ming Chen | Hen-Hsen Huang | Ming-Feng Tsai | An-Zi Yen | Chuan-Ju Wang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Our system is built upon a multi-modal information extraction pipeline designed to process and interpret corporate sustainability reports. This integrated framework systematically handles diverse data formats—including text, tables, figures, and infographics—to extract, structure, and evaluate ESG-related content. The extracted multi-modal data is subsequently formalized into a structured knowledge graph (KG), which serves as both a semantic framework for representing entities, relationships, and metrics relevant to ESG domains, and as the foundational infrastructure for the automated compliance system. This KG enables high-precision retrieval of information across multiple source formats and reporting modalities. The trustworthy, context-rich representations provided by the knowledge graph establish a verifiable evidence base, creating a critical foundation for reliable retrieval-augmented generation (RAG) and subsequent LLM-based scoring and analysis of automatic ESG compliance system.

2024

pdf bib abs

MAGIC: Multi-Argument Generation with Self-Refinement for Domain Generalization in Automatic Fact-Checking
Wei-Yu Kao | An-Zi Yen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Numerous studies have been conducted on automatic fact-checking, driven by its importance in real-world applications. However, two challenges persist: (1) extracting pivotal evidence from extensive documents, and (2) verifying claims across diverse domains. On one hand, current retrieval methods are limited in their ability to concisely retrieve evidence, which results in poor performance. On the other hand, retrieved evidence derived from different sources strains the generalization capabilities of classifiers. This paper explores the task of cross-domain fact-checking and presents the XClaimCheck dataset, which consists of claims from multiple domains. We propose a framework featuring a multi-argument generation technique. We leverage multi-argument generation to reconstruct concise evidence from large amounts of evidence retrieved from different sources. In addition, a self-refinement mechanism is introduced to confirm that the generated arguments are consistent with the content of the evidence. Experimental results show that our proposed framework is effective in identifying the veracity of out-of-domain claims, particularly those that are partially true or false.

2023

pdf bib abs

Three Questions Concerning the Use of Large Language Models to Facilitate Mathematics Learning
An-Zi Yen | Wei-Ling Hsu
Findings of the Association for Computational Linguistics: EMNLP 2023

Due to the remarkable language understanding and generation abilities of large language models (LLMs), their use in educational applications has been explored. However, little work has been done on investigating the pedagogical ability of LLMs in helping students to learn mathematics. In this position paper, we discuss the challenges associated with employing LLMs to enhance students’ mathematical problem-solving skills by providing adaptive feedback. Apart from generating the wrong reasoning processes, LLMs can misinterpret the meaning of the question, and also exhibit difficulty in understanding the given questions’ rationales when attempting to correct students’ answers. Three research questions are formulated.

pdf bib abs

LED: A Dataset for Life Event Extraction from Dialogs
Yi-Pei Chen | An-Zi Yen | Hen-Hsen Huang | Hideki Nakayama | Hsin-Hsi Chen
Findings of the Association for Computational Linguistics: EACL 2023

Lifelogging has gained more attention due to its wide applications, such as personalized recommendations or memory assistance. The issues of collecting and extracting personal life events have emerged. People often share their life experiences with others through conversations. However, extracting life events from conversations is rarely explored. In this paper, we present Life Event Dialog, a dataset containing fine-grained life event annotations on conversational data. In addition, we initiate a novel Conversational Life Event Extraction task and differentiate the task from the public event extraction or the life event extraction from other sources like microblogs. We explore three information extraction (IE) frameworks to address the Conversational Life Event Extraction task: OpenIE, relation extraction, and event extraction. A comprehensive empirical analysis of the three baselines is established. The results suggest that the current event extraction model still struggles with extracting life events from human daily conversations. Our proposed Life Event Dialog dataset and in-depth analysis of IE frameworks will facilitate future research on life event extraction from conversations.

pdf bib abs

ZARA: Improving Few-Shot Self-Rationalization for Small Language Models
Wei-Lin Chen | An-Zi Yen | Cheng-Kuang Wu | Hen-Hsen Huang | Hsin-Hsi Chen
Findings of the Association for Computational Linguistics: EMNLP 2023

Language models (LMs) that jointly generate end-task answers as well as free-text rationales are known as self-rationalization models. Recent works demonstrate great performance gain for self-rationalization by few-shot prompting LMs with rationale-augmented exemplars. However, the ability to benefit from explanations only emerges with large-scale LMs, which have poor accessibility. In this work, we explore the less-studied setting of leveraging explanations for small LMs to improve few-shot self-rationalization. We first revisit the relationship between rationales and answers. Inspired by the implicit mental process of how human beings assess explanations, we present a novel approach, Zero-shot Augmentation of Rationale-Answer pairs (ZARA), to automatically construct pseudo-parallel data for self-training by reducing the problem of plausibility judgement to natural language inference. Experimental results show ZARA achieves SOTA performance on the FEB benchmark, for both the task accuracy and the explanation metric. In addition, we conduct human and quantitative evaluation validating ZARA’s ability to automatically identify plausible and accurate rationale-answer pairs.

pdf bib abs

RSVP: Customer Intent Detection via Agent Response Contrastive and Generative Pre-Training
Yu-Chien Tang | Wei-Yao Wang | An-Zi Yen | Wen-Chih Peng
Findings of the Association for Computational Linguistics: EMNLP 2023

The dialogue systems in customer services have been developed with neural models to provide users with precise answers and round-the-clock support in task-oriented conversations by detecting customer intents based on their utterances. Existing intent detection approaches have highly relied on adaptively pre-training language models with large-scale datasets, yet the predominant cost of data collection may hinder their superiority. In addition, they neglect the information within the conversational responses of the agents, which have a lower collection cost, but are significant to customer intent as agents must tailor their replies based on the customers’ intent. In this paper, we propose RSVP, a self-supervised framework dedicated to task-oriented dialogues, which utilizes agent responses for pre-training in a two-stage manner. Specifically, we introduce two pre-training tasks to incorporate the relations of utterance-response pairs: 1) Response Retrieval by selecting a correct response from a batch of candidates, and 2) Response Generation by mimicking agents to generate the response to a given utterance. Our benchmark results for two real-world customer service datasets show that RSVP significantly outperforms the state-of-the-art baselines by 4.95% for accuracy, 3.4% for MRR@3, and 2.75% for MRR@5 on average. Extensive case studies are investigated to show the validity of incorporating agent responses into the pre-training stage.

2022

pdf bib abs

Learning to Generate Explanation from e-Hospital Services for Medical Suggestion
Wei-Lin Chen | An-Zi Yen | Hen-Hsen Huang | Hsin-Hsi Chen
Proceedings of the 29th International Conference on Computational Linguistics

Explaining the reasoning of neural models has attracted attention in recent years. Providing highly-accessible and comprehensible explanations in natural language is useful for humans to understand model’s prediction results. In this work, we present a pilot study to investigate explanation generation with a narrative and causal structure for the scenario of health consulting. Our model generates a medical suggestion regarding the patient’s concern and provides an explanation as the outline of the reasoning. To align the generated explanation with the suggestion, we propose a novel discourse-aware mechanism with multi-task learning. Experimental results show that our model achieves promising performances in both quantitative and human evaluation.

pdf bib abs

SEEN: Structured Event Enhancement Network for Explainable Need Detection of Information Recall Assistance
You-En Lin | An-Zi Yen | Hen-Hsen Huang | Hsin-Hsi Chen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

When recalling life experiences, people often forget or confuse life events, which necessitates information recall services. Previous work on information recall focuses on providing such assistance reactively, i.e., by retrieving the life event of a given query. Proactively detecting the need for information recall services is rarely discussed. In this paper, we use a human-annotated life experience retelling dataset to detect the right time to trigger the information recall service. We propose a pilot model—structured event enhancement network (SEEN) that detects life event inconsistency, additional information in life events, and forgotten events. A fusing mechanism is also proposed to incorporate event graphs of stories and enhance the textual representations. To explain the need detection results, SEEN simultaneously provides support evidence by selecting the related nodes from the event graph. Experimental results show that SEEN achieves promising performance in detecting information needs. In addition, the extracted evidence can be served as complementary information to remind users what events they may want to recall.

An-Zi Yen

2026

2024

2023

2022

2018

Co-authors

Venues