Changhao Jiang
2026
From Scores to Preferences: Redefining Evaluation Paradigm for Speech Quality Reward Modeling
Yifei Cao | Changhao Jiang | Jiabao Zhuang | Jiajun Sun | Ming Zhang | Zhiheng Xi | Hui Li | Shihan Dou | Yuran Wang | Yunke Zhang | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Yifei Cao | Changhao Jiang | Jiabao Zhuang | Jiajun Sun | Ming Zhang | Zhiheng Xi | Hui Li | Shihan Dou | Yuran Wang | Yunke Zhang | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Speech quality assessment (SQA) is typically formulated as a score regression task based on subjective ratings, such as the Mean Opinion Score (MOS), which inherently suffer from inconsistent standards and limit cross-dataset training and evaluation. To address these limitations, we reformulate SQA as a preference-based comparison paradigm and construct MOS-Pref, a large-scale MOS-derived preference dataset. Building on MOS-Pref, we systematically implement and evaluate three reward modeling paradigms: scalar, semi-scalar, and generative reward models, alongside existing SQA approaches. Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) score regression-based approaches generally underperform preference-based methods in both overall performance and generalization; and (3) all reward models struggle on pairs with very small MOS gap. Motivated by these observations, we propose a MOS-aware GRM design that incorporates MOS gap into the reward function during reinforcement learning. Experimental results show that the MOS-aware GRM significantly improves fine-grained speech quality discrimination. We hope this work fosters more rigorous and scalable research in SQA.
Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control
Changhao Jiang | Jiahao Chen | Zhenghao Xiang | Zhixiong Yang | Hanchen Wang | Jiabao Zhuang | Xinmeng Che | Jiajun Sun | Hui Li | Yifei Cao | Shihan Dou | Ming Zhang | Junjie Ye | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Changhao Jiang | Jiahao Chen | Zhenghao Xiang | Zhixiong Yang | Hanchen Wang | Jiabao Zhuang | Xinmeng Che | Jiajun Sun | Hui Li | Yifei Cao | Shihan Dou | Ming Zhang | Junjie Ye | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Recent commercial systems such as Suno demonstrate strong capabilities in long-form song generation, while academic research remains largely non-reproducible due to the lack of publicly available training data, hindering fair comparison and progress. To this end, we release a fully open-source system for long-form song generation with fine-grained style conditioning, including a licensed synthetic dataset, training and evaluation pipelines, and Muse, an easy-to-deploy song generation model. The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions paired with audio synthesized by SunoV5. We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens using MuCodec, without task-specific losses, auxiliary objectives, or additional architectural components. Our evaluations find that although Muse is trained with a modest data scale and model size, it achieves competitive performance on phoneme error rate, text–music style similarity, and audio aesthetic quality, while enabling controllable segment-level generation across different musical structures. All data, model weights, and training and evaluation pipelines will be publicly released, paving the way for continued progress in controllable long-form song generation research.
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
Junjie Ye | Changhao Jiang | Zhengyin Du | Yufei Xu | Xuesong Yao | Zhiheng Xi | Xiaoran Fan | Qi Zhang | Tao Gui | Xuanjing Huang | Jiecao Chen
Findings of the Association for Computational Linguistics: ACL 2026
Junjie Ye | Changhao Jiang | Zhengyin Du | Yufei Xu | Xuesong Yao | Zhiheng Xi | Xiaoran Fan | Qi Zhang | Tao Gui | Xuanjing Huang | Jiecao Chen
Findings of the Association for Computational Linguistics: ACL 2026
Effective tool use is essential for large language models (LLMs) to interact with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models’ tool-use performance without degrading their general capabilities. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models. Code and data are available at https://github.com/bytedance/FTRL.
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-Fair offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
Changhao Jiang | Ming Zhang | Yifei Cao | Junjie Ye | Xiaoran Fan | Shihan Dou | Zhiheng Xi | Jiajun Sun | Yi Dong | Yujiong Shen | Jingqi Tong | Baoyu Fan | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Changhao Jiang | Ming Zhang | Yifei Cao | Junjie Ye | Xiaoran Fan | Shihan Dou | Zhiheng Xi | Jiajun Sun | Yi Dong | Yujiong Shen | Jingqi Tong | Baoyu Fan | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves R² > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies.
2025
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Ming Zhang | Yujiong Shen | Zelin Li | Huayu Sha | Binze Hu | Yuhui Wang | Chenhao Huang | Shichun Liu | Jingqi Tong | Changhao Jiang | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Ming Zhang | Yujiong Shen | Zelin Li | Huayu Sha | Binze Hu | Yuhui Wang | Chenhao Huang | Shichun Liu | Jingqi Tong | Changhao Jiang | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Medicine, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains.
PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts
Ming Zhang | Yuhui Wang | Yujiong Shen | Tingyi Yang | Changhao Jiang | Yilong Wu | Shihan Dou | Qinhao Chen | Zhiheng Xi | Zhihao Zhang | Yi Dong | Zhen Wang | Zhihui Fei | Mingyang Wan | Tao Liang | Guojun Ma | Qi Zhang | Tao Gui | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Ming Zhang | Yuhui Wang | Yujiong Shen | Tingyi Yang | Changhao Jiang | Yilong Wu | Shihan Dou | Qinhao Chen | Zhiheng Xi | Zhihao Zhang | Yi Dong | Zhen Wang | Zhihui Fei | Mingyang Wan | Tao Liang | Guojun Ma | Qi Zhang | Tao Gui | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models’ performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in https://github.com/KongLongGeFDU/PFDial.
Search
Fix author
Co-authors
- Tao Gui 7
- Xuan-Jing Huang (黄萱菁) 7
- Shihan Dou 6
- Zhiheng Xi 6
- Ming Zhang 6
- Qi Zhang 5
- Yujiong Shen 4
- Yifei Cao 3
- Jiajun Sun 3
- Jingqi Tong 3
- Junjie Ye (叶俊杰) 3
- Mingxu Chai 2
- Yi Dong 2
- Xiaoran Fan 2
- Tao Ji 2
- Hui Li 2
- Shichun Liu 2
- Huayu Sha 2
- Yuhui Wang 2
- Yilong Wu 2
- Qi Zhang 2
- Jiabao Zhuang 2
- Xinmeng Che 1
- Jiahao Chen 1
- Jiecao Chen 1
- Qinhao Chen 1
- Jingyi Deng 1
- Zhengyin Du 1
- Baoyu Fan 1
- Zhihui Fei 1
- Binze Hu 1
- Chenhao Huang 1
- Yueyuan Huang 1
- Zelin Li 1
- Tao Liang 1
- Guojun Ma 1
- Qiyuan Peng 1
- Kexin Tan 1
- Mingyang Wan 1
- Hanchen Wang 1
- Junzhe Wang 1
- Yuhui Wang 1
- Yuran Wang 1
- Zhen Wang 1
- Mingqi Wu 1
- Zhenghao Xiang 1
- Yufei Xu 1
- Tingyi Yang 1
- Zhixiong Yang 1
- Xuesong Yao 1
- Yue Zhang 1
- Yunke Zhang 1
- Zhihao Zhang 1
- Zhihao Zhang 1