Joon-Ho Lim
Also published as: Joon-ho Lim
2026
Beyond Accuracy: Alignment and Error Detection across Languages in the Bi-GSM8K Math-Teaching Benchmark
Jieun Park | KyungTae Lim | Joon-ho Lim
Findings of the Association for Computational Linguistics: EACL 2026
Jieun Park | KyungTae Lim | Joon-ho Lim
Findings of the Association for Computational Linguistics: EACL 2026
Recent advancements in LLMs have significantly improved mathematical problem-solving, with models like GPT-4 achieving human-level performance. However, proficiently solving mathematical problems differs fundamentally from effectively teaching mathematics. To bridge this gap, we introduce the Bi-GSM8K benchmark, a bilingual English-Korean dataset enriched with teacher solutions, student solutions, and annotations marking students’ initial errors. This dataset is designed to evaluate two core capabilities of LLMs: (1) measuring similarity between student and teacher solutions, and (2) identifying the initial error point in student solutions. Our method achieves high agreement with human judgments, with Pearson 0.89 and Spearman 0.88 on English, and Pearson 0.89 and Spearman 0.87 on Korean. It also offers significantly lower latency and resource usage than commercial APIs, demonstrating strong computational efficiency. In the error detection task, open-source models achieved approximately 86% accuracy, with performance within 10% points of commercial LLMs API, suggesting strong practical potential. Our key contributions include the open-source release of Bi-GSM8K, novel evaluation metrics, and comparative analyses of LLM performance across languages.
2025
Can LLMs Truly Plan? A Comprehensive Evaluation of Planning Capabilities
Gayeon Jung | HyeonSeok Lim | Minjun Kim | Joon-ho Lim | KyungTae Lim | Hansaem Kim
Findings of the Association for Computational Linguistics: EMNLP 2025
Gayeon Jung | HyeonSeok Lim | Minjun Kim | Joon-ho Lim | KyungTae Lim | Hansaem Kim
Findings of the Association for Computational Linguistics: EMNLP 2025
The existing assessments of planning capabilities of large language models (LLMs) remain largely limited to single-language or specific representation formats. To address this gap, we introduce the Multi-Plan benchmark comprising 204 multilingual and multi-format travel planning scenarios. In experimental results obtained with state-of-the-art LLMs, the Multi-Plan benchmark effectively highlights the performance disparities among models, notably showing superior results for reasoning-specialized models. Interestingly, language differences exhibited minimal impact, whereas mathematically structured representations significantly improved planning accuracy for most models, underscoring the crucial role of the input format. These findings enhance our understanding of planning abilities of LLMs, offer valuable insights for future research, and emphasize the need for more sophisticated AI evaluation methods. This dataset is publicly available at http://huggingface.co/datasets/Bllossom/Multi-Plan.
2024
KULTURE Bench: A Benchmark for Assessing Language Model in Korean Cultural Context
Xiaonan Wang | Jinyoung Yeo | Joon-Ho Lim | Hansaem Kim
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation
Xiaonan Wang | Jinyoung Yeo | Joon-Ho Lim | Hansaem Kim
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation
2019
QE BERT: Bilingual BERT Using Multi-task Learning for Neural Quality Estimation
Hyun Kim | Joon-Ho Lim | Hyun-Ki Kim | Seung-Hoon Na
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
Hyun Kim | Joon-Ho Lim | Hyun-Ki Kim | Seung-Hoon Na
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
For translation quality estimation at word and sentence levels, this paper presents a novel approach based on BERT that recently has achieved impressive results on various natural language processing tasks. Our proposed model is re-purposed BERT for the translation quality estimation and uses multi-task learning for the sentence-level task and word-level subtasks (i.e., source word, target word, and target gap). Experimental results on Quality Estimation shared task of WMT19 show that our systems show competitive results and provide significant improvements over the baseline.