Joonghyuk Hahn


2026

Language models (LMs) have shown impressive reasoning capabilities across various domains. A fundamental question is the extent of their reasoning power. While recent studies show that LMs can solve NP-complete problems, their ability to handle PSPACE-complete problems remains underexplored. We investigate regex minimization as a PSPACE-complete challenge for LMs to address this issue. Regexes, formal expressions for regular languages widely used in NLP, software engineering (SE), and programming language (PL), are supported by several efficient tools for their manipulation grounded in theoretical backgrounds. Inspired by this, we introduce the first benchmark for regex minimization containing over a million regexes paired with their minimal equivalents. Through extensive experiments with two LMs trained on our dataset and five open-source large language models (LLMs), we analyze how well LMs perform on PSPACE-complete problems, highlighting their capabilities of generalization and limitations in reasoning. To the best of our knowledge, this is the first study to systematically evaluate LM reasoning in regex minimization and establish a foundation for solving PSPACE-complete problems with LMs. Our code is available at https://github.com/hyundong98/RegexPSPACE.
Regular expressions (regexes) are foundational to modern computing for criticaltasks like input validation and data parsing, yet their ubiquity exposes systemsto regular expression denial of service (ReDoS), a vulnerability requiringautomated repair methods. Current approaches, however, are hampered by atrade-off. Symbolic, rule-based systems are precise but fail to repair unseen orcomplex vulnerability patterns. Conversely, large language models (LLMs) possessthe necessary generalizability but are unreliable for tasks demanding strictsyntactic and semantic correctness. We resolve this impasse by introducing ahybrid framework, localized regex repair (LRR), designed to harness LLMgeneralization while enforcing reliability. Our core insight is to decoupleproblem identification from the repair process. First, a deterministic, symbolicmodule localizes the precise vulnerable subpattern, creating a constrained andtractable problem space. Then, the LLM is invoked to generate a semanticallyequivalent fix for this isolated segment. This combined architecturesuccessfully resolves complex repair cases intractable for rule-based repairwhile avoiding the semantic errors of LLM-only approaches. Our work provides avalidated methodology for solving such problems in automated repair, improvingthe repair rate by 15.4%p over the state-of-the-art.
While large language models (LLMs) excel at generating structured data, such as code, their ability to precisely manipulate it based on instructions remains relatively under-explored. Regular expressions (regexes), critical in practice, are challenging to manipulate. Crucially, the correctness of transformations can be mathematically verified, making them exceptionally well-suited for measuring the symbolic reasoning of LLMs. We introduce Query4Regex, a new benchmark for evaluating verifiable transformations on regexes. Our benchmark tests two query formats: natural language instructions and a program-like domain-specific language (DSL) that specifies the sequence of operations. We evaluate a range of LLMs, verifying semantic correctness through rigorous deterministic finite automata (DFA) equivalence testing. Our empirical studies reveal: 1) the formal DSL significantly outperforms natural language, achieving up to 6.74%p accuracy gains on average. 2) Performance for both formats degrades sharply as compositional complexity increases, highlighting a core challenge in multi-step reasoning. 3) Models often generate plausible but unparsable outputs. Even among parsable outputs, semantic errors remain common, making failures difficult to detect without formal verification. Query4Regex provides a robust framework for analyzing the gap between LLMs’ linguistic fluency and their symbolic reasoning, paving the way for more reliable and verifiable manipulation of formal languages. Our code is available at https://github.com/peer0/Query4Regex.

2025

Implicit hate speech detection is challenging due to its subtlety and reliance on contextual interpretation rather than explicit offensive words. Current approaches rely on contrastive learning, which are shown to be effective on distinguishing hate and non-hate sentences. Humans, however, detect implicit hate speech by first identifying specific targets within the text and subsequently interpreting how these target relate to their surrounding context. Motivated by this reasoning process, we propose AmpleHate, a novel approach designed to mirror human inference for implicit hate detection. AmpleHate identifies explicit target using a pretrained Named Entity Recognition model and capture implicit target information via [CLS] tokens. It computes attention-based relationships between explicit, implicit targets and sentence context and then, directly injects these relational vectors into the final sentence representation. This amplifies the critical signals of target-context relations for determining implicit hate. Experiments demonstrate that AmpleHate achieves state-of-the-art performance, outperforming contrastive learning baselines by an average of 82.14% and achieve faster convergence. Qualitative analyses further reveal that attention patterns produced by AmpleHate closely align with human judgement, underscoring its interpretability and robustness.
Reasoning ability of large language models (LLMs) is a crucial ability,especially in complex decision-making tasks. One significant task to show LLMs’reasoning capability is code time complexity prediction, which involves variousintricate factors such as the input range of variables and conditional loops.Current benchmarks fall short of providing a rigorous assessment due to limiteddata, language constraints, and insufficient labeling. They do not consider timecomplexity based on input representation and merely evaluate whether predictionsfall into the same class, lacking a measure of how close incorrect predictionsare to the correct ones.To address these dependencies, we introduce CodeComplex, the first robust andextensive dataset designed to evaluate LLMs’ reasoning abilities in predictingcode time complexity. CodeComplex comprises 4,900 Java codes and an equivalentnumber of Python codes, overcoming language and labeling constraints, carefullyannotated with complexity labels based on input characteristics by a panel ofalgorithmic experts. Additionally, we propose specialized evaluation metrics forthe reasoning of complexity prediction tasks, offering a more precise andreliable assessment of LLMs’ reasoning capabilities. We release our dataset andbaseline models publicly to encourage the relevant (NLP, SE, and PL) communitiesto utilize and participate in this research. Our code and data are available athttps://github.com/sybaik1/CodeComplex.

2024

In few-shot text classification, self-training is a popular tool in semi-supervised learning (SSL). It relies on pseudo-labels to expand data, which has demonstrated success. However, these pseudo-labels contain potential noise and provoke a risk of underfitting the decision boundary. While the pseudo-labeled data can indeed be noisy, fully acquiring this flawed data can result in the accumulation of further noise and eventually impacting the model performance. Consequently, self-training presents a challenge: mitigating the accumulation of noise in the pseudo-labels. Confronting this challenge, we introduce superficial learning, inspired by pedagogy’s focus on essential knowledge. Superficial learning in pedagogy is a learning scheme that only learns the material ‘at some extent’, not fully understanding the material. This approach is usually avoided in education but counter-intuitively in our context, we employ superficial learning to acquire only the necessary context from noisy data, effectively avoiding the noise. This concept serves as the foundation for SuperST, our self-training framework. SuperST applies superficial learning to the noisy data and fine-tuning to the less noisy data, creating an efficient learning cycle that prevents overfitting to the noise and spans the decision boundary effectively. Notably, SuperST improves the classifier accuracy for few-shot text classification by 18.5% at most and 8% in average, compared with the state-of-the-art SSL baselines. We substantiate our claim through empirical experiments and decision boundary analysis.

2023

Solving math word problems depends on how to articulate the problems, the lens through which models view human linguistic expressions. Real-world settings count on such a method even more due to the diverse practices of the same mathematical operations. Earlier works constrain available thinking processes by limited prediction strategies without considering their significance in acquiring mathematical knowledge. We introduce Attention-based THought Expansion Network Architecture (ATHENA) to tackle the challenges of real-world practices by mimicking human thought expansion mechanisms in the form of neural network propagation. A thought expansion recurrently generates the candidates carrying the thoughts of possible math expressions driven from the previous step and yields reasonable thoughts by selecting the valid pathways to the goal. Our experiments show that ATHENA achieves a new state-of-the-art stage toward the ideal model that is compelling in variant questions even when the informativeness in training examples is restricted.
Recent studies propose various data augmentation approaches to resolve the low-resource problem in natural language processing tasks. Data augmentation is a successful solution to this problem and recent strategies give variation on sentence structures to boost performance. However, these approaches can potentially lead to semantic errors and produce semantically noisy data due to the unregulated variation of sentence structures. In an effort to combat these semantic errors, we leverage slot information, the representation of the context of keywords from a sentence, and form a data augmentation strategy which we propose, called GDA. Our strategy employs algorithms that construct and manipulate rules of context-aware grammar, utilizing this slot information. The algorithms extract recurrent patterns by distinguishing words with slots and form the “rules of grammar”—a set of injective relations between a sentence’s semantics and its syntactical structure—to augment the dataset. The augmentation is done in an automated manner with the constructed rules and thus, GDA is explainable and reliable without any human intervention. We evaluate GDA with state-of-the-art data augmentation techniques, including those using pre-trained language models, and the result illustrates that GDA outperforms all other data augmentation methods by 19.38%. Extensive experiments show that GDA is an effective data augmentation strategy that incorporates word semantics for more accurate and diverse data.

2022

Recent research on code summarization relies on the structural information from the abstract syntax tree (AST) of source codes. It is, however, questionable whether it is the most effective to use AST for expressing the structural information. We find that a program dependency graph (PDG) can represent the structure of a code more effectively. We propose PDG Boosting Module (PBM) that encodes PDG into graph embedding and the framework to implement the proposed PBM with the existing models. PBM achieves improvements of 6.67% (BLEU) and 7.47% (ROUGE) on average. We then analyze the experimental results, and examine how PBM helps the training of baseline models and its performance robustness. For the validation of robustness, we measure the performance of an out-of-domain benchmark dataset, and confirm its robustness. In addition, we apply a new evaluation measure, SBERT score, to evaluate the semantic performance. The models implemented with PBM improve the performance of SBERT score. This implies that they generate summaries that are semantically more similar to the reference summary.

2021

We tackle the problem of self-training networks for NLU in low-resource environment—few labeled data and lots of unlabeled data. The effectiveness of self-training is a result of increasing the amount of training data while training. Yet it becomes less effective in low-resource settings due to unreliable labels predicted by the teacher model on unlabeled data. Rules of grammar, which describe the grammatical structure of data, have been used in NLU for better explainability. We propose to use rules of grammar in self-training as a more reliable pseudo-labeling mechanism, especially when there are few labeled data. We design an effective algorithm that constructs and expands rules of grammar without human involvement. Then we integrate the constructed rules as a pseudo-labeling mechanism into self-training. There are two possible scenarios regarding data distribution: it is unknown or known in prior to training. We empirically demonstrate that our approach substantially outperforms the state-of-the-art methods in three benchmark datasets for both scenarios.