Hyuk Namgoong


2026

Domain-specific Named Entity Recognition (NER) often requires data augmentation due to the scarcity of annotated corpora. Guidance Data Augmentation (GDA), a method utilizing Large Language Models (LLMs) to decompose sentences into abstract components, can lead to over-abstraction, resulting in undefined entity tags and sentences lacking domain-specific vocabulary. In this work, we propose Reflective GDA (R-GDA), a framework that introduces a multi-agent feedback loop to enhance augmentation quality. R-GDA incorporates two distinct agents: a **Guidance Refiner (GR)**, which assesses the initial abstraction to prevent over-generalization, and an **Augmentation Calibrator (AC)**, which validates the final generated sample for domain-fidelity and tag integrity. On the SciERC and NCBI-disease datasets, R-GDA improves F1-Score, validating its effectiveness. Concurrently, it achieves low BERTScore in most cases, indicating greater sentence diversity. For the FIN dataset, it achieves performance comparable to the GDA baseline. R-GDA consistently prevents errors regarding domain-specific tags, demonstrating that the reflective feedback mechanism enhances data fidelity by mitigating critical generation errors.
Large Language Models (LLMs) face challenges in integrating linguistic and spatial reasoning, which limits their performance on geometry problems. While prior work has attempted to bridge this gap using diagram parsers with multimodal models, a systematic comparison of how various auxiliary modalities and their combinations affect performance has been lacking. To address this, we present a systematic study of four auxiliary modalities—formal diagram facts (CDL), natural language representations (TCDL), diagram descriptions (DES), and image augmentations (IMG)—on a range of open- and closed-source multimodal LLMs. Our analysis reveals a compelling dichotomy in the effectiveness of these modalities. While formal representations like CDL and TCDL offer a modest performance lift, diagram descriptions (DES) cause a dramatic split: they significantly boost the accuracy of open-source LLMs which often struggle with visual parsing, while often misleading more capable closed-source models and causing a performance drop. This highlights a critical trade-off between augmenting input with helpful information and introducing misleading noise, demonstrating that the efficacy of auxiliary modalities is heavily dependent on the inherent capabilities of the underlying model.

2025

Many statistical facts are conveyed through charts. While various methods have emerged for chart understanding, chart generation typically requires users to manually input code, intent, and other parameters to obtain the desired format on chart generation tools. Recently, the advent of image-generating Large Language Models has facilitated chart generation; however, even this process often requires users to provide numerous constraints for accurate results. In this paper, we propose a loop-based framework for automatically evolving charts in a multi-agent environment. Within this framework, three distinct agents—Chart Code Generator, Chart Replier, and Chart Quality Evaluator—collaborate for iterative, user-tailored chart generation using large language models. Our approach demonstrates an improvement of up to 29.97% in performance compared to first generation, while also reducing generation time by up to 86.9% compared to manual prompt-based methods, showcasing the effectiveness of this multi-agent collaboration in enhancing the quality and efficiency of chart generation.
Recent advances in large language models (LLMs) have significantly improved mathematical problem-solving. Among various techniques, paraphrasing problem statements has emerged as a promising strategy to enhance model understanding and accuracy.We define twelve paraphrasing types grounded in mathematics education theory and analyze their impact on LLM performance across different configurations. To automate selection, we propose a Paraphrase Type Selector that predicts effective paraphrases for each problem.Experiments on MATH-500, SVAMP, and AIME shows consistent performance gain from paraphrased problems. On MATH-500 with LLaMA 3.1-8B, combining the original with the best five paraphrased problems improves accuracy by +8.4%, with the selector achieving an additional +1.33% gain.

2024

Recent advancements in large language models have heavily relied on the large reward model from reinforcement learning from human feedback for fine-tuning. However, the use of a single reward model across various domains may not always be optimal, often requiring retraining from scratch when new domain data is introduced. To address these challenges, we explore the utilization of small language models operating in a domain-specific manner based on router mechanisms. Our three approaches are: 1) utilize mixture of experts to form a single reward model by modularizing an internal router and experts, 2) employing external router to select the appropriate reward model from multiple domain-specific models, and 3) the framework reduces parameter size by loading reward models and router adapters onto a single small language model using adapters. Experimental validation underscores the effectiveness of our approach, demonstrating performance comparable to baseline methods while also reducing the total parameter size.