Yoichi Ishibashi

2025

Can Large Language Models Invent Algorithms to Improve Themselves?: Algorithm Discovery for Recursive Self-Improvement through Reinforcement Learning
Yoichi Ishibashi | Taro Yano | Masafumi Oyamada
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) have shown remarkable performance improvements and are rapidly gaining adoption in industry. However, the methods for improving LLMs are still designed by humans, which restricts the invention of new model-improving algorithms to human expertise and imagination. To address this, we propose the Self-Developing framework, which enables LLMs to autonomously generate and learn model-improvement algorithms. In this framework, the seed model generates, applies, and learns model-improving algorithms, continuously improving both the seed model and the algorithms themselves. Among model-improving strategies, we focus on model merging algorithms. In mathematical reasoning tasks, Self-Developing discovers novel merging strategies and outperforms human-designed methods. On GSM8k, the discovered algorithms improve the seed model by 6% and surpass human-designed methods by 4.3%. Moreover, they exhibit strong transferability, achieving a 7.4% performance gain on out-of-domain models. These results suggest that LLMs can autonomously develop effective model-improvement techniques beyond human intuition.

pdf bib abs

LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents
Taro Yano | Yoichi Ishibashi | Masafumi Oyamada
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) excel across diverse tasks, with post-training methods like Supervised Fine-Tuning (SFT), Preference Learning, and Model Merging enabling effective domain and task adaptation. While outcomes can vary with data orderings or component combinations, yet manual pipeline optimization is costly and labor-intensive. Existing approaches typically rely on manual design or focus narrowly on optimizing individual components, such as data ordering or merging parameters. We propose LaMDAgent, an LLM Agent-driven framework that autonomously constructs and optimizes end-to-end post-training pipelines by exploring various model improving methods, objects, and their applied orderings based on task-based feedback. LaMDAgent achieves a 9.0-point gain in tool-use accuracy without degrading instruction-following, and identifies high-performing strategies overlooked by manual design.We further analyze the impact of data and model scaling to reduce computational costs on the exploration, finding that model size scalings introduces new challenges, whereas scaling data size enables cost-effective pipeline discovery.

pdf bib abs

DeLTa: A Decoding Strategy based on Logit Trajectory Prediction Improves Factuality and Reasoning Ability
Yunzhen He | Yusuke Takase | Yoichi Ishibashi | Hidetoshi Shimodaira
Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)

Large Language Models (LLMs) are increasingly being used in real-world applications. However, concerns about the reliability of the content they generate persist, as it frequently deviates from factual correctness or exhibits deficiencies in logical reasoning. This paper proposes a novel decoding strategy aimed at enhancing both factual accuracy and inferential reasoning without requiring any modifications to the architecture or pre-trained parameters of LLMs. Our approach adjusts next-token probabilities by analyzing the trajectory of logits from lower to higher layers in Transformers and applying linear regression. We find that this Decoding by Logit Trajectory-based approach (DeLTa) effectively reinforces factuality and reasoning while mitigating incorrect generation. Experiments on TruthfulQA demonstrate that DeLTa attains up to a 4.9% improvement over the baseline. Furthermore, it enhances performance by up to 8.1% on StrategyQA and 7.3% on GSM8K, both of which demand strong reasoning capabilities.

2024

pdf bib abs

Subspace Representations for Soft Set Operations and Sentence Similarities
Yoichi Ishibashi | Sho Yokoi | Katsuhito Sudoh | Satoshi Nakamura
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In the field of natural language processing (NLP), continuous vector representations are crucial for capturing the semantic meanings of individual words. Yet, when it comes to the representations of sets of words, the conventional vector-based approaches often struggle with expressiveness and lack the essential set operations such as union, intersection, and complement. Inspired by quantum logic, we realize the representation of word sets and corresponding set operations within pre-trained word embedding spaces. By grounding our approach in the linear subspaces, we enable efficient computation of various set operations and facilitate the soft computation of membership functions within continuous spaces. Moreover, we allow for the computation of the F-score directly within word vectors, thereby establishing a direct link to the assessment of sentence similarity. In experiments with widely-used pre-trained embeddings and benchmarks, we show that our subspace-based set operations consistently outperform vector-based ones in both sentence similarity and set retrieval tasks.

2023

pdf bib abs

Evaluating the Robustness of Discrete Prompts
Yoichi Ishibashi | Danushka Bollegala | Katsuhito Sudoh | Satoshi Nakamura
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Discrete prompts have been used for fine-tuning Pre-trained Language Models for diverse NLP tasks. In particular, automatic methods that generate discrete prompts from a small set of training instances have reported superior performance. However, a closer look at the learnt prompts reveals that they contain noisy and counter-intuitive lexical constructs that would not be encountered in manually-written prompts. This raises an important yet understudied question regarding the robustness of automatically learnt discrete prompts when used in downstream tasks. To address this question, we conduct a systematic study of the robustness of discrete prompts by applying carefully designed perturbations into an application using AutoPrompt and then measure their performance in two Natural Language Inference (NLI) datasets. Our experimental results show that although the discrete prompt-based method remains relatively robust against perturbations to NLI inputs, they are highly sensitive to other types of perturbations such as shuffling and deletion of prompt tokens. Moreover, they generalize poorly across different NLI datasets. We hope our findings will inspire future work on robust discrete prompt learning.

2021

pdf bib abs

Multilingual Machine Translation Evaluation Metrics Fine-tuned on Pseudo-Negative Examples for WMT 2021 Metrics Task
Kosuke Takahashi | Yoichi Ishibashi | Katsuhito Sudoh | Satoshi Nakamura
Proceedings of the Sixth Conference on Machine Translation

This paper describes our submission to the WMT2021 shared metrics task. Our metric is operative to segment-level and system-level translations. Our belief toward a better metric is to detect a significant error that cannot be missed in the real practice cases of evaluation. For that reason, we used pseudo-negative examples in which attributes of some words are transferred to the reversed attribute words, and we build evaluation models to handle such serious mistakes of translations. We fine-tune a multilingual largely pre-trained model on the provided corpus of past years’ metric task and fine-tune again further on the synthetic negative examples that are derived from the same fine-tune corpus. From the evaluation results of the WMT21’s development corpus, fine-tuning on the pseudo-negatives using WMT15-17 and WMT18-20 metric corpus achieved a better Pearson’s correlation score than the one fine-tuned without negative examples. Our submitted models,hyp+src_hyp+ref and hyp+src_hyp+ref.negative, are the plain model using WMT18-20 and the one additionally fine-tuned on negative samples, respectively.

2020

pdf bib abs

Reflection-based Word Attribute Transfer
Yoichi Ishibashi | Katsuhito Sudoh | Koichiro Yoshino | Satoshi Nakamura
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Word embeddings, which often represent such analogic relations as king - man + woman queen, can be used to change a word’s attribute, including its gender. For transferring king into queen in this analogy-based manner, we subtract a difference vector man - woman based on the knowledge that king is male. However, developing such knowledge is very costly for words and attributes. In this work, we propose a novel method for word attribute transfer based on reflection mappings without such an analogy operation. Experimental results show that our proposed method can transfer the word attributes of the given words without changing the words that do not have the target attributes.

Co-authors

Yunzhen He 1

Hidetoshi Shimodaira 1

Venues

WMT1

WS1

Fix author