Efficiently Selecting Response Generation Strategies for Synthetic Data Construction by Self-Aligned Perplexity

Xuan Ren; Qi Chen; Lingqiao Liu

Efficiently Selecting Response Generation Strategies for Synthetic Data Construction by Self-Aligned Perplexity

Abstract

Fine-tuning large language models (LLMs) typically relies on producing large sets of input-output pairs. Yet for a given question, there can be many valid outputs. In practice, these outputs are often derived by distilling knowledge from teacher models, and they can vary depending on the specific teacher model or prompting strategy employed.Recent findings show that how these training outputs are generated can significantly affect the performance of the fine-tuned model, raising an important question: how do we pick the best data generation method from among numerous possibilities? Rather than exhaustively training and evaluating on each candidate, this paper proposes a scalable approximate method that assesses a small subset of generated data to estimate its suitability for a specific target LLM. Our central idea is that effective outputs should be familiar to the target LLM. While previous work measures familiarity with perplexity, we find that perplexity might be suboptimal in characterizing “familiarity” through empirical analyses and practical observations. To address this, we introduce self-aligned perplexity, a novel metric capturing how closely candidate outputs adhere to the target LLM’s own style and reasoning patterns. In this way, we can identify the most effective generation strategy on a small sample, then apply it to produce the complete training set. We demonstrate that training on data generated by the chosen method yields significant improvements across diverse reasoning-focused benchmarks, particularly in cases where different candidate methods lead to highly divergent training outcomes.

Anthology ID:: 2025.findings-emnlp.621
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11584–11605
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.621/
DOI:
Bibkey:
Cite (ACL):: Xuan Ren, Qi Chen, and Lingqiao Liu. 2025. Efficiently Selecting Response Generation Strategies for Synthetic Data Construction by Self-Aligned Perplexity. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 11584–11605, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Efficiently Selecting Response Generation Strategies for Synthetic Data Construction by Self-Aligned Perplexity (Ren et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.621.pdf
Checklist:: 2025.findings-emnlp.621.checklist.pdf

PDF Cite Search Checklist Fix data