Seung Jin Lee


2025

pdf bib
A Testset for Context-Aware LLM Translation in Korean-to-English Discourse Level Translation
Minjae Lee | Youngbin Noh | Seung Jin Lee
Proceedings of the 31st International Conference on Computational Linguistics

Large Language Models (LLMs) demonstrate remarkable performance in machine translation. Recent studies indicate that for high-resource languages, LLM surpasses encoder-decoder neural machine translation (NMT) models. However, evaluation datasets used in many LLM-based translation studies are often compromised by data leakage and lack demanding datasets that accurately gauge the potential and limitations of LLMs in human-like translation. This paper introduces a manually constructed Korean-English discourse-level corpus comprising 600 text instances featuring six linguistic phenomena: lexical ambiguity, zero anaphora, slang, idiom, figurative language, and implicature. Utilizing this challenge test set, we investigated LLM’s Korean-to-English translation capability, particularly in cases requiring inter-sentential context based semantic inference. The findings reveal that state-of-the-art LLM, such as GPT-4o, still struggle with specific linguistic phenomena that can be challenging for machine translation. Additionally, step-by-step prompting, such as Chain-of-Thought (CoT) prompting, significantly enhance the translation performance of LLMs compared to zero-shot prompting.

pdf bib
LLM Agents at the Roundtable: A Multi-Perspective and Dialectical Reasoning Framework for Essay Scoring
Jinhee Jang | Ayoung Moon | Minkyoung Jung | YoungBin Kim | Seung Jin Lee
Findings of the Association for Computational Linguistics: EMNLP 2025

The emergence of large language models (LLMs) has brought a new paradigm to automated essay scoring (AES), a long-standing and practical application of natural language processing in education. However, achieving human-level multi-perspective understanding and judgment remains a challenge. In this work, we propose Roundtable Essay Scoring (RES), a multi-agent evaluation framework designed to perform precise and human-aligned scoring under a zero-shot setting. RES constructs evaluator agents based on LLMs, each tailored to a specific prompt and topic context. Each agent independently generates a trait-based rubric and conducts a multi-perspective evaluation. Then, by simulating a roundtable-style discussion, RES consolidates individual evaluations through a dialectical reasoning process to produce a final holistic score that more closely aligns with human evaluation. By enabling collaboration and consensus among agents with diverse evaluation perspectives, RES outperforms prior zero-shot AES approaches. Experiments on the ASAP dataset using ChatGPT and Claude show that RES achieves up to a 34.86% improvement in average QWK over straightforward prompting (Vanilla) methods.