Ukyo Honda

2026

BiCSRouter: Bi-Level Cross-System Routing for Utility-Aware LLM Inference
Mao Keyu | Eiki Murata | Ukyo Honda
Findings of the Association for Computational Linguistics: ACL 2026

Selecting an appropriate LLM configuration for a given query is critical, yet existing routing frameworks operate within a single computational paradigm. To address this gap, we formalize the Cross-System Routing Problem, a hierarchical decision-making task that decomposes routing into intra-regime configuration selection and inter-regime system selection. Building on this, we propose BiCSRouter, a bi-level cross-system routing framework that integrates two orthogonal regimes: intensive reasoning via single-agent systems and extensive collaboration via multi-agent systems. BiCSRouter performs policy learning within each system and employs a lightweight inter-regime router that selects the optimal regime based on predicted performance and cost. Experiments on the MBPP and MATH benchmarks demonstrate that BiCSRouter outperforms 15 representative baselines across three types. On MBPP, compared to the performance ceiling of GPT-5, BiCSRouter achieves a 46% reduction in cost with only a 2% drop in accuracy. Finally, we show that BiCSRouter can extend to additional regimes, highlighting its generality as a cross-system routing framework.

pdf bib abs

Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
Sho Hoshino | Ukyo Honda | Peinan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

While self-consistency is known to improve performance on symbolic reasoning, its effect on the recall of encyclopedic knowledge is unclear due to a lack of targeted evaluation grounds.To address this, we establish such a knowledge recall split for the popular MMLU benchmark by applying a data-driven heuristic from prior work.We validate this split by showing that the performance patterns on the symbolic reasoning and knowledge recall subsets mirror those of GSM8K and MedMCQA, respectively.Using this solid ground, we find that self-consistency consistently improves performance across both symbolic reasoning and knowledge recall, even though its underlying CoT prompting is primarily effective for symbolic reasoning.As a result, we achieve an 89% accuracy on MMLU, the best performance to date with the use of GPT-4o.

2025

pdf bib abs

Distilling Many-Shot In-Context Learning into a Cheat Sheet
Ukyo Honda | Soichiro Murakami | Peinan Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent advances in large language models (LLMs) enable effective in-context learning (ICL) with many-shot examples, but at the cost of high computational demand due to longer input tokens. To address this, we propose cheat-sheet ICL, which distills the information from many-shot ICL into a concise textual summary (cheat sheet) used as the context at inference time. Experiments on challenging reasoning tasks show that cheat-sheet ICL achieves comparable or better performance than many-shot ICL with far fewer tokens, and matches retrieval-based ICL without requiring test-time retrieval. These findings demonstrate that cheat-sheet ICL is a practical alternative for leveraging LLMs in downstream tasks.

pdf bib abs

Annotation-Efficient Language Model Alignment via Diverse and Representative Response Texts
Yuu Jinnai | Ukyo Honda
Findings of the Association for Computational Linguistics: EMNLP 2025

Preference optimization is a standard approach to fine-tuning large language models to align with human preferences. The quantity, diversity, and representativeness of the preference dataset are critical to the effectiveness of preference optimization. However, obtaining a large amount of preference annotations is difficult in many applications. This raises the question of how to use the limited annotation budget to create an effective preference dataset. To this end, we propose Annotation-Efficient Preference Optimization (AEPO). Instead of exhaustively annotating preference over all available response texts, AEPO selects a subset of responses that maximizes diversity and representativeness from the available responses and then annotates preference over the selected ones. In this way, AEPO focuses the annotation budget on labeling preferences over a smaller but informative subset of responses. We evaluate the performance of preference learning using AEPO on three datasets and show that it outperforms the baselines with the same annotation budget.

pdf bib abs

Exploring Explanations Improves the Robustness of In-Context Learning
Ukyo Honda | Tatsushi Oka
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In-context learning (ICL) has emerged as a successful paradigm for leveraging large language models (LLMs). However, it often struggles to generalize beyond the distribution of the provided demonstrations. A recent advancement in enhancing robustness is ICL with explanations (X-ICL), which improves prediction reliability by guiding LLMs to understand and articulate the reasoning behind correct labels. Building on this approach, we introduce an advanced framework that extends X-ICL by systematically exploring explanations for all possible labels (X²-ICL), thereby enabling more comprehensive and robust decision-making. Experimental results on multiple natural language understanding datasets validate the effectiveness of X²-ICL, demonstrating significantly improved robustness to out-of-distribution data compared to the existing ICL approaches.

2024

pdf bib abs

Not Eliminate but Aggregate: Post-Hoc Control over Mixture-of-Experts to Address Shortcut Shifts in Natural Language Understanding
Ukyo Honda | Tatsushi Oka | Peinan Zhang | Masato Mita
Transactions of the Association for Computational Linguistics, Volume 12

Recent models for natural language understanding are inclined to exploit simple patterns in datasets, commonly known as shortcuts. These shortcuts hinge on spurious correlations between labels and latent features existing in the training data. At inference time, shortcut-dependent models are likely to generate erroneous predictions under distribution shifts, particularly when some latent features are no longer correlated with the labels. To avoid this, previous studies have trained models to eliminate the reliance on shortcuts. In this study, we explore a different direction: pessimistically aggregating the predictions of a mixture-of-experts, assuming each expert captures relatively different latent features. The experimental results demonstrate that our post-hoc control over the experts significantly enhances the model’s robustness to the distribution shift in shortcuts. Additionally, we show that our approach has some practical advantages. We also analyze our model and provide results to support the assumption.1

pdf bib abs

Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation
Hao Wang | Tetsuro Morimura | Ukyo Honda | Daisuke Kawahara
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT). However, a performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty in capturing dependency between target words accurately. Compounding this, preparing appropriate training data for NAR models is a non-trivial task, often exacerbating exposure bias. To address these challenges, we apply reinforcement learning (RL) to Levenshtein Transformer, a representative edit-based NAR model, demonstrating that RL with self-generated data can enhance the performance of edit-based NAR models. We explore two RL approaches: stepwise reward maximization and episodic reward maximization. We discuss the respective pros and cons of these two approaches and empirically verify them. Moreover, we experimentally investigate the impact of temperature setting on performance, confirming the importance of proper temperature setting for NAR models’ training.

pdf bib abs

On the True Distribution Approximation of Minimum Bayes-Risk Decoding
Atsumoto Ohashi | Ukyo Honda | Tetsuro Morimura | Yuu Jinnai
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Minimum Bayes-risk (MBR) decoding has recently gained renewed attention in text generation.MBR decoding considers texts sampled from a model as pseudo-references and selects the text with the highest similarity to the others.Therefore, sampling is one of the key elements of MBR decoding, and previous studies reported that the performance varies by sampling methods.From a theoretical standpoint, this performance variation is likely tied to how closely the samples approximate the true distribution of references.However, this approximation has not been the subject of in-depth study.In this study, we propose using anomaly detection to measure the degree of approximation.We first closely examine the performance variation and then show that previous hypotheses about samples do not correlate well with the variation, but our introduced anomaly scores do.The results are the first to empirically support the link between the performance and the core assumption of MBR decoding.

pdf bib abs

CAMERA³: An Evaluation Dataset for Controllable Ad Text Generation in Japanese
Go Inoue | Akihiko Kato | Masato Mita | Ukyo Honda | Peinan Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Ad text generation is the task of creating compelling text from an advertising asset that describes products or services, such as a landing page. In advertising, diversity plays an important role in enhancing the effectiveness of an ad text, mitigating a phenomenon called “ad fatigue,” where users become disengaged due to repetitive exposure to the same advertisement. Despite numerous efforts in ad text generation, the aspect of diversifying ad texts has received limited attention, particularly in non-English languages like Japanese. To address this, we present CAMERA³, an evaluation dataset for controllable text generation in the advertising domain in Japanese. Our dataset includes 3,980 ad texts written by expert annotators, taking into account various aspects of ad appeals. We make CAMERA³ publicly available, allowing researchers to examine the capabilities of recent NLG models in controllable text generation in a real-world scenario.

pdf bib abs

A Single Linear Layer Yields Task-Adapted Low-Rank Matrices
Hwichan Kim | Shota Sasaki | Sho Hoshino | Ukyo Honda
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Low-Rank Adaptation (LoRA) is a widely used Parameter-Efficient Fine-Tuning (PEFT) method that updates an initial weight matrix W₀ with a delta matrix 𝛥 W consisted by two low-rank matrices A and B. A previous study suggested that there is correlation between W₀ and 𝛥 W. In this study, we aim to delve deeper into relationships between W₀ and low-rank matrices A and B to further comprehend the behavior of LoRA. In particular, we analyze a conversion matrix that transform W₀ into low-rank matrices, which encapsulates information about the relationships. Our analysis reveals that the conversion matrices are similar across each layer. Inspired by these findings, we hypothesize that a single linear layer, which takes each layer’s W₀ as input, can yield task-adapted low-rank matrices. To confirm this hypothesis, we devise a method named Conditionally Parameterized LoRA (CondLoRA) that updates initial weight matrices with low-rank matrices derived from a single linear layer. Our empirical results show that CondLoRA maintains a performance on par with LoRA, despite the fact that the trainable parameters of CondLoRA are fewer than those of LoRA. Therefore, we conclude that “a single linear layer yields task-adapted low-rank matrices.” The code used in our experiments is available at https://github.com/CyberAgentAILab/CondLoRA.

pdf bib abs

Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding
Yuu Jinnai | Ukyo Honda | Tetsuro Morimura | Peinan Zhang
Findings of the Association for Computational Linguistics: ACL 2024

One of the most important challenges in text generation systems is to produce outputs that are not only correct but also diverse.Recently, Minimum Bayes-Risk (MBR) decoding has gained prominence for generating sentences of the highest quality among the decoding algorithms. However, existing algorithms proposed to generate diverse outputs are predominantly based on beam search or random sampling, thus their output quality is capped by these underlying decoding algorithms. In this paper, we investigate an alternative approach – we develop diversity-promoting decoding algorithms by enforcing diversity objectives to MBR decoding.We propose two variants of MBR; (i) Diverse MBR (DMBR) that adds a diversity penalty to the decoding objective and (ii) k-medoids MBR (KMBR) that reformulates the decoding task as a clustering problem.We evaluate DMBR and KMBR on a variety of directed text generation tasks using encoder-decoder models and a language model with prompting. The experimental results show that the proposed method achieves a better trade-off than the diverse beam search and sampling algorithms overall.

2022

pdf bib

Law Retrieval with Supervised Contrastive Learning Using the Hierarchical Structure of Law
Jungmin Choi | Ukyo Honda | Taro Watanabe | Hiroki Ouchi | Kentaro Inui
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

2021

pdf bib abs

Removing Word-Level Spurious Alignment between Images and Pseudo-Captions in Unsupervised Image Captioning
Ukyo Honda | Yoshitaka Ushiku | Atsushi Hashimoto | Taro Watanabe | Yuji Matsumoto
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs, but only with images and sentences drawn from different sources and object labels detected from the images. In previous work, pseudo-captions, i.e., sentences that contain the detected object labels, were assigned to a given image. The focus of the previous work was on the alignment of input images and pseudo-captions at the sentence level. However, pseudo-captions contain many words that are irrelevant to a given image. In this work, we investigate the effect of removing mismatched words from image-sentence alignment to determine how they make this task difficult. We propose a simple gating mechanism that is trained to align image features with only the most reliable words in pseudo-captions: the detected object labels. The experimental results show that our proposed method outperforms the previous methods without introducing complex sentence-level learning objectives. Combined with the sentence-level alignment method of previous work, our method further improves its performance. These results confirm the importance of careful alignment in word-level details.

2018

pdf bib abs

Pruning Basic Elements for Better Automatic Evaluation of Summaries
Ukyo Honda | Tsutomu Hirao | Masaaki Nagata
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We propose a simple but highly effective automatic evaluation measure of summarization, pruned Basic Elements (pBE). Although the BE concept is widely used for the automated evaluation of summaries, its weakness is that it redundantly matches basic elements. To avoid this redundancy, pBE prunes basic elements by (1) disregarding frequency count of basic elements and (2) reducing semantically overlapped basic elements based on word similarity. Even though it is simple, pBE outperforms ROUGE in DUC datasets in most cases and achieves the highest rank correlation coefficient in TAC 2011 AESOP task.

Co-authors

Venues