Yuu Jinnai

2025

Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport
Yuu Jinnai
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Document-level text generation tasks are known to be more difficult than sentence-level text generation tasks as they require an understanding of longer context to generate high-quality texts. In this paper, we investigate the adaptation of Minimum Bayes Risk (MBR) decoding for document-level text generation tasks. MBR decoding makes use of a utility function to estimate the output with the highest expected utility from a set of candidate outputs. Although MBR decoding is shown to be effective in a wide range of sentence-level text generation tasks, its performance on document-level text generation tasks is limited, as many of the utility functions are designed for evaluating the utility of sentences. To this end, we propose MBR-OT, a variant of MBR decoding using Wasserstein distance to compute the utility of a document using a sentence-level utility function. The experimental result shows that the performance of MBR-OT outperforms that of the standard MBR in document-level machine translation, text simplification, and dense image captioning tasks.

pdf bib abs

Theoretical Guarantees for Minimum Bayes Risk Decoding
Yuki Ichihara | Yuu Jinnai | Kaito Ariu | Tetsuro Morimura | Eiji Uchibe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Minimum Bayes Risk (MBR) decoding optimizes output selection by maximizing the expected utility value of an underlying human distribution. While prior work has shown the effectiveness of MBR decoding through empirical evaluation, few studies have analytically investigated why the method is effective. As a result of our analysis, we show that, given the size n of the reference hypothesis set used in computation, MBR decoding approaches the optimal solution with high probability at a rate of 𝒪(n^-¹⁄₂), under certain assumptions, even though the language space 𝒴 is significantly larger |𝒴| ≫ n.This result helps to theoretically explain the strong performance observed in several prior empirical studies on MBR decoding. In addition, we provide the performance gap for maximum-a-posteriori (MAP) decoding and compare it to MBR decoding. The result of this paper indicates that MBR decoding tends to converge to the optimal solution faster than MAP decoding in several cases.

pdf bib abs

Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales
Ayuto Tsutsumi | Yuu Jinnai
Findings of the Association for Computational Linguistics: ACL 2025

Although Large Language Models (LLMs) have demonstrated strong language understanding and generation abilities across various languages, their cultural knowledge is often limited to English-speaking communities, which can marginalize the cultures of non-English communities. To address the problem, the evaluation of the cultural awareness of the LLMs and the methods to develop culturally aware LLMs have been investigated. In this study, we focus on evaluating knowledge of folktales, a key medium for conveying and circulating culture. In particular, we focus on Japanese folktales, specifically on knowledge of Yokai. Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today. Yokai have long served as a medium for cultural expression, making them an ideal subject for assessing the cultural awareness of LLMs. We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions (each with four options) designed to probe knowledge about yokai. We evaluate the performance of 31 Japanese and multilingual LLMs on this dataset. The results show that models trained with Japanese language resources achieve higher accuracy than English-centric models, with those that underwent continued pretraining in Japanese, particularly those based on Llama-3, performing especially well.

pdf bib abs

Annotation-Efficient Language Model Alignment via Diverse and Representative Response Texts
Yuu Jinnai | Ukyo Honda
Findings of the Association for Computational Linguistics: EMNLP 2025

Preference optimization is a standard approach to fine-tuning large language models to align with human preferences. The quantity, diversity, and representativeness of the preference dataset are critical to the effectiveness of preference optimization. However, obtaining a large amount of preference annotations is difficult in many applications. This raises the question of how to use the limited annotation budget to create an effective preference dataset. To this end, we propose Annotation-Efficient Preference Optimization (AEPO). Instead of exhaustively annotating preference over all available response texts, AEPO selects a subset of responses that maximizes diversity and representativeness from the available responses and then annotates preference over the selected ones. In this way, AEPO focuses the annotation budget on labeling preferences over a smaller but informative subset of responses. We evaluate the performance of preference learning using AEPO on three datasets and show that it outperforms the baselines with the same annotation budget.

pdf bib abs

Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment
Yuu Jinnai | Tetsuro Morimura | Kaito Ariu | Kenshi Abe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking when the accuracy of the reward model is not high enough. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A common solution to prevent reward hacking in preference learning techniques is to optimize a reward using proximity regularization, which ensures that the language model remains close to the reference model. In this research, we propose MBR-BoN, a variant of BoN that aims to mitigate reward hacking at inference time by incorporating the Minimum Bayes Risk (MBR) objective as a proximity regularization term. We show empirically and analytically that the MBR objective quantifies the proximity of the response to the reference policy, serving as a proximity regularizer for BoN sampling. We evaluate MBR-BoN on the AlpacaFarm and Anthropic’s hh-rlhf datasets and show that it outperforms both BoN sampling and MBR decoding. As an application of MBR-BoN, we use it to generate a pairwise preference learning dataset. Experimental results show that DPO models trained on a dataset generated with MBR-BoN outperform a DPO model generated with vanilla BoN.

pdf bib abs

Auto-Weighted Group Relative Preference Optimization for Multi-Objective Text Generation Tasks
Yuki Ichihara | Yuu Jinnai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Group Relative Policy Optimization (GRPO) is a promising approach to complex, real-world tasks, such as those involving multiple rewards or strict constraints. However, when training GRPO with multiple rewards, the weights of each reward must be decided in advance. Failing to balance the objectives adequately can lead to overfitting or insufficient learning of each reward function. To address this problem, we propose Auto-Weighted Group Relative Policy Optimization (AW-GRPO), which adjusts reward weights during training according to the progress of the learning of each objective so far.We evaluate AW-GRPO on advertising text generation, a real-world problem where the generated text must satisfy multiple objectives, such as quality and diversity, while adhering to the constraints of the media (e.g., maximum number of characters).Our results show that AW-GRPO successfully balances multiple objectives, improving the overall scores while reducing the constraint violation rate.We additionally evaluate AW-GRPO using publicly available benchmark problems for reproducibility, in which we observe the same qualitative result that the proposed method outperforms GRPO.

2024

pdf bib abs

Does Cross-Cultural Alignment Change the Commonsense Morality of Language Models?
Yuu Jinnai
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP

Alignment of the language model with human preferences is a common approach to making a language model useful to end users.However, most alignment work is done in English, and human preference datasets are dominated by English, reflecting only the preferences of English-speaking annotators.Nevertheless, it is common practice to use the English preference data, either directly or by translating it into the target language, when aligning a multilingual language model.The question is whether such an alignment strategy marginalizes the preference of non-English speaking users.To this end, we investigate the effect of aligning Japanese language models with (mostly) English resources.In particular, we focus on evaluating whether the commonsense morality of the resulting fine-tuned models is aligned with Japanese culture using the JCommonsenseMorality (JCM) and ETHICS datasets.The experimental results show that the fine-tuned model outperforms the SFT model. However, it does not demonstrate the same level of improvement as a model fine-tuned using the JCM, suggesting that while some aspects of commonsense morality are transferable, others may not be.

pdf bib abs

Filtered Direct Preference Optimization
Tetsuro Morimura | Mitsuki Sakamoto | Yuu Jinnai | Kenshi Abe | Kaito Ariu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on direct preference optimization (DPO), an increasingly adopted reward-model-free RLHF method. We confirm that text quality significantly influences the performance of models optimized with DPO more than those optimized with reward-model-based RLHF. Building on this new insight, we propose an extension of DPO, termed filtered direct preference optimization (fDPO). fDPO uses a trained reward model to monitor the quality of texts within the preference dataset during DPO training. Samples of lower quality are discarded based on comparisons with texts generated by the model being optimized, resulting in a more accurate dataset. Experimental results demonstrate that fDPO enhances the final model performance. Our code is available at https://github.com/CyberAgentAILab/filtered-dpo.

pdf bib abs

Hyperparameter-Free Approach for Faster Minimum Bayes Risk Decoding
Yuu Jinnai | Kaito Ariu
Findings of the Association for Computational Linguistics: ACL 2024

Minimum Bayes-Risk (MBR) decoding is shown to be a powerful alternative to beam search decoding for a wide range of text generation tasks. However, MBR requires a huge amount of time for inference to compute the MBR objective, which makes the method infeasible in many situations where response time is critical. Confidence-based pruning (CBP) (Cheng and Vlachos, 2023) has recently been proposed to reduce the inference time in machine translation tasks. Although it is shown to significantly reduce the amount of computation, it requires hyperparameter tuning using a development set to be effective. To this end, we propose Adaptive Minimum Bayes-Risk (AMBR) decoding, a hyperparameter-free method to run MBR decoding efficiently. AMBR is derived from the observation that the problem of computing the sample-based MBR objective is the medoid identification problem. AMBR uses the Correlated Sequential Halving (CSH) algorithm (Baharav and Tse, 2019), the algorithm with the best performance guarantee to date for the medoid identification problem, to compute the sample-based MBR objective. We evaluate AMBR on machine translation, text summarization, and image captioning tasks. The results show that AMBR achieves on par with CBP, with CBP selecting hyperparameters through an Oracle for each given computation budget.

pdf bib abs

On the True Distribution Approximation of Minimum Bayes-Risk Decoding
Atsumoto Ohashi | Ukyo Honda | Tetsuro Morimura | Yuu Jinnai
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Minimum Bayes-risk (MBR) decoding has recently gained renewed attention in text generation.MBR decoding considers texts sampled from a model as pseudo-references and selects the text with the highest similarity to the others.Therefore, sampling is one of the key elements of MBR decoding, and previous studies reported that the performance varies by sampling methods.From a theoretical standpoint, this performance variation is likely tied to how closely the samples approximate the true distribution of references.However, this approximation has not been the subject of in-depth study.In this study, we propose using anomaly detection to measure the degree of approximation.We first closely examine the performance variation and then show that previous hypotheses about samples do not correlate well with the variation, but our introduced anomaly scores do.The results are the first to empirically support the link between the performance and the core assumption of MBR decoding.

pdf bib abs

Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding
Yuu Jinnai | Ukyo Honda | Tetsuro Morimura | Peinan Zhang
Findings of the Association for Computational Linguistics: ACL 2024

One of the most important challenges in text generation systems is to produce outputs that are not only correct but also diverse.Recently, Minimum Bayes-Risk (MBR) decoding has gained prominence for generating sentences of the highest quality among the decoding algorithms. However, existing algorithms proposed to generate diverse outputs are predominantly based on beam search or random sampling, thus their output quality is capped by these underlying decoding algorithms. In this paper, we investigate an alternative approach – we develop diversity-promoting decoding algorithms by enforcing diversity objectives to MBR decoding.We propose two variants of MBR; (i) Diverse MBR (DMBR) that adds a diversity penalty to the decoding objective and (ii) k-medoids MBR (KMBR) that reformulates the decoding task as a clustering problem.We evaluate DMBR and KMBR on a variety of directed text generation tasks using encoder-decoder models and a language model with prompting. The experimental results show that the proposed method achieves a better trade-off than the diverse beam search and sampling algorithms overall.

Co-authors

Venues

WS1

Fix author