Ahrii Kim

2026

Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation
Nuo Xu | Ahrii Kim
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)

Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword paradigms—Byte Pair Encoding (BPE), Overlap BPE (OBPE), and Unigram Language Model—across six Uralic languages with varying resource availability and typological diversity.Using part-of-speech (POS) tagging as a controlled downstream task, we show that OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods, particularly within the Latin-script group. These gains arise from reduced fragmentation in open-class categories and a better balance across the frequency spectrum. Transfer efficacy further depends on the downstream tagging architecture, interacting with both training volume and genealogical proximity.Taken together, these findings highlight that morphology-sensitive tokenization is not merely a preprocessing choice but a decisive factor in enabling effective cross-lingual transfer for agglutinative, low-resource languages.

2025

pdf bib abs

RUBRIC-MQM : Span-Level LLM-as-judge in Machine Translation For High-End Models
Ahrii Kim
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Referred to as LLM-as-judge, a generative large language model (LLM) has demonstrated considerable efficacy as an evaluator in various tasks, including Machine Translation (LAJ-MT) by predicting scores or identifying error types for individual sentences. However, its dependability in practical application has yet to be demonstrated, as there is only an approximated match due to the task’s open-ended nature. To address this problem, we introduce a straightforward and novel meta-evaluation strategy PromptCUE and evaluate cutting-edge LAJ-MT models such as GEMBA-MQM. We identify their fundamental deficits, including certain label biases and the inability to assess near-perfect translations.To improve reliability, we investigate more trustworthy and less biased models using multidimensional prompt engineering. Our findings indicate that the combination of span-level error quantification and a rubric-style prompt tailored to the characteristics of LLMs has efficiently addressed the majority of the challenges current LAJ-MT models face. Furthermore, it demonstrates a considerably enhanced alignment with human values. Accordingly, we present Rubric-MQM, the LAJ-MT for high-end models and an updated version of GEMBA-MQM.

pdf bib abs

Context Is Ubiquitous, but Rarely Changes Judgments: Revisiting Document-Level MT Evaluation
Ahrii Kim
Proceedings of the Tenth Conference on Machine Translation

As sentence-level performance in modern Machine Translation (MT) has plateaued, reliable document-level evaluation is increasingly needed. While the recent FALCON framework with pragmatic features offers a promising direction, its reliability and reproducibility are unclear. We address this gap through human evaluation, analyzing sources of low inter-annotator agreement and identifying key factors. Based on these findings, we introduce H-FALCON, a Human-centered refinement of FALCON. Our experiments show that, even with limited annotator consensus, FALCON achieves correlations comparable to or better than standard sentence-level protocols.Furthermore, we find that contextual information is inherent in all sentences, challenging the view that only some require it. This suggests that prior estimates such as “n% of sentences require context” may stem from methodological artifacts. At the same time, we show that while context is pervasive, not all of it directly influences human judgment.

pdf bib abs

Multi-agentMT: Deploying AI Agent in the WMT25 Shared Task
Ahrii Kim
Proceedings of the Tenth Conference on Machine Translation

We present Multi-agentMT, our system for the WMT25 General Shared Task. The model adopts Prompt Chaining, a multi-agent workflow combined with Rubric-MQM, an automatic MQM-based error annotation metric. Our primary submission follows a Translate–Postedit–Proofread pipeline, in which error positions are explicitly marked and iteratively refined. Results suggest that a semi-autonomous agent scheme for machine translation is feasible with a smaller, earlier-generation model in low-resource settings, achieving comparable quality at roughly half the cost of larger systems.

pdf bib abs

A Preliminary Study of AI Agent Model in Machine Translation
Ahrii Kim
Proceedings of the Tenth Conference on Machine Translation

We present IR_Multi-agentMT, our submission to the WMT25 General Shared Task. The system adopts an AI-agent paradigm implemented through a multi-agent workflow, Prompt Chaining, in combination with RUBRIC-MQM, an automatic MQM-based error annotation metric. Our primary configuration follows the Translate–Postedit–Proofread paradigm, where each stage progressively enhances translation quality. We conduct a preliminary study to investigate (i) the impact of initial translation quality and (ii) the effect of enforcing explicit responses from the Postedit Agent. Our findings highlight the importance of both factors in shaping the overall performance of multi-agent translation systems.

2022

pdf bib abs

Vacillating Human Correlation of SacreBLEU in Unprotected Languages
Ahrii Kim | Jinhyeon Kim
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

SacreBLEU, by incorporating a text normalizing step in the pipeline, has become a rising automatic evaluation metric in recent MT studies. With agglutinative languages such as Korean, however, the lexical-level metric cannot provide a conceivable result without a customized pre-tokenization. This paper endeavors to ex- amine the influence of diversified tokenization schemes –word, morpheme, subword, character, and consonants & vowels (CV)– on the metric after its protective layer is peeled off. By performing meta-evaluation with manually- constructed into-Korean resources, our empirical study demonstrates that the human correlation of the surface-based metric and other homogeneous ones (as an extension) vacillates greatly by the token type. Moreover, the human correlation of the metric often deteriorates due to some tokenization, with CV one of its culprits. Guiding through the proper usage of tokenizers for the given metric, we discover i) the feasibility of the character tokens and ii) the deficit of CV in the Korean MT evaluation.

Co-authors

Jinhyeon Kim 1
Nuo Xu 1

Venues

Fix author