Yusuke Yamauchi
2026
From Semantics to Style: A Cross-Dataset Comparative Framework for Sentence Similarity Predictions
Yusuke Yamauchi | Akiko Aizawa
Findings of the Association for Computational Linguistics: EACL 2026
Yusuke Yamauchi | Akiko Aizawa
Findings of the Association for Computational Linguistics: EACL 2026
While Semantic Textual Similarity (STS) task serves as a cornerstone embedding task in natural language processing, the definition of similarity is inherently ambiguous and dataset-specific. Comprehensive cross-dataset analysis remains scarce, leaving it uncertain whether language models perceive diverse semantic and stylistic nuances as humans do. To address this, we propose a comparative framework utilizing lightweight poolers on a frozen encoder to conduct a unified analysis across STS, Paraphrase Identification (PI), and Triplet datasets. Experimental results on 21 datasets indicate a high correlation of semantic concepts between STS and PI settings, while highlighting style as a distinct dimension necessitating explicit separation from semantics. Moreover, Procrustes, layer-wise and hierarchical clustering analyses elucidate the varying properties of these concepts and the structural organization of the embedding space. These insights imply that treating semantics and style as separate components in embedding models is crucial for enhancing both interpretability and practical utility.
2025
Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality
Yuto Harada | Yusuke Yamauchi | Yusuke Oda | Yohei Oseki | Yusuke Miyao | Yu Takagi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yuto Harada | Yusuke Yamauchi | Yusuke Oda | Yohei Oseki | Yusuke Miyao | Yu Takagi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT.Our findings reveal that some training–task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness, often surpassing superficial similarity between the training data and the benchmark, and that mid-layer weight changes correlate most strongly with performance gains. We release these 1,000+ SFT models and benchmark results to accelerate further research. All resources are available at https://github.com/llm-jp/massive-sft.