Jinu Lee


2025

pdf bib
Entailment-Preserving First-order Logic Representations in Natural Language Entailment
Jinu Lee | Qi Liu | Runzhi Ma | Vincent Han | Ziqi Wang | Heng Ji | Julia Hockenmaier
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

First-order logic (FOL) is often used to represent logical entailment, but determining natural language (NL) entailment using FOL remains a challenge. To address this, we propose the Entailment-Preserving FOL representations (EPF) task and introduce reference-free evaluation metrics for EPF (Entailment-Preserving Rate (EPR) family). In EPF, one should generate FOL representations from multi-premise NL entailment data (e.g., EntailmentBank) so that the automatic prover’s result preserves the entailment labels. Furthermore, we propose a training method specialized for the task, iterative learning-to-rank, which trains an NL-to-FOL translator by using the natural language entailment labels as verifiable rewards. Our method achieves a 1.8–2.7% improvement in EPR and a 17.4–20.6% increase in EPR@16 compared to diverse baselines in three datasets. Further analyses reveal that iterative learning-to-rank effectively suppresses the arbitrariness of FOL representation by reducing the diversity of predicate signatures, and maintains strong performance across diverse inference types and out-of-domain data.

pdf bib
LegalSearchLM: Rethinking Legal Case Retrieval as Legal Elements Generation
Chaeeun Kim | Jinu Lee | Wonseok Hwang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Legal Case Retrieval (LCR), which retrieves relevant cases from a query case, is a fundamental task for legal professionals in research and decision-making. However, existing studies on LCR face two major limitations. First, they are evaluated on relatively small-scale retrieval corpora (e.g., 100-55K cases) and use a narrow range of criminal query types, which cannot sufficiently reflect the complexity of real-world legal retrieval scenarios. Second, their reliance on embedding-based or lexical matching methods often results in limited representations and legally irrelevant matches. To address these issues, we present: (1) LEGAR BENCH, the first large-scale Korean LCR benchmark, covering 411 diverse crime types in queries over 1.2M candidate cases; and (2) LegalSearchLM, a retrieval model that performs legal element reasoning over the query case and directly generates content containing those elements, grounded in the target cases through constrained decoding. Experimental results show that LegalSearchLM outperforms baselines by 6 - 20% on LEGAR BENCH, achieving state-of-the-art performance. It also demonstrates strong generalization to out-of-domain cases, outperforming naive generative models trained on in-domain data by 15%.

pdf bib
Evaluating Step-by-step Reasoning Traces: A Survey
Jinu Lee | Julia Hockenmaier
Findings of the Association for Computational Linguistics: EMNLP 2025

Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, existing evaluation practices are highly inconsistent, resulting in fragmented progress across evaluator design and benchmark development. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (factuality, validity, coherence, and utility). Based on the taxonomy, we review different datasets, evaluator implementations, and recent findings, leading to promising directions for future research.

pdf bib
SymBa: Symbolic Backward Chaining for Structured Natural Language Reasoning
Jinu Lee | Wonseok Hwang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

To improve the performance and explainability of LLM-based natural language reasoning, structured reasoning can be applied to generate explicitly structured proofs. Among different methods for structured reasoning, we specifically focus on backward chaining, where the proof goal is recursively decomposed to subgoals by searching and applying rules. We argue that current LLM-based backward chaining systems (e.g. Least-to-most prompting and LAMBADA) are incomplete, as they omit crucial algorithmic components identified from the classic backward chaining algorithm in computational logic (SLD Resolution). To this end, we propose a novel backward chaining system, SymBa (Symbolic Backward Chaining), which integrates a symbolic solver and an LLM. In SymBa, the solver controls the proof process, and the LLM is only called when the solver requires new information to complete the proof. Empowered by completeness, SymBa achieves a significant improvement in seven deductive, relational, and arithmetic reasoning benchmarks compared to the baselines.

2023

pdf bib
Learning to Rank Generation with Pairwise Partial Rewards
Youngwon Lee | Jinu Lee | Seung-won Hwang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This paper studies the use of reinforcement learning for conditional text generation, which overcomes the limitation of the prevalent supervised maximum likelihood estimation approach. However, it still suffers from challenges including the large action space and the delayed reward, as the reward can be computed only after an entire sequence is generated. To address these challenges, we propose a method that provides partial rewards for intermediate actions taken on partial sequences. This enables the model to promptly prioritize actions that lead to the generation of more desirable sequences. Our method’s key contribution lies in its focus on distinguishing relatively more desirable actions rather than striving to precisely estimate pointwise values for arbitrary partial sequences. Instead, our model learns to discern the relative desirability between pairs of actions, or rank actions in a pairwise manner, only when necessary and feasible. This is materialized in an efficient way by leveraging the prefix tree constructed from the sampled sequences. Experimental results on paraphrase generation and constrained machine translation tasks showcase the effectiveness of our method.