Peng Cheng
2025
Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training
Zheheng Luo | Xin Zhang | Xiao Liu | Haoling Li | Yeyun Gong | Qi Chen | Peng Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zheheng Luo | Xin Zhang | Xiao Liu | Haoling Li | Yeyun Gong | Qi Chen | Peng Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
It is well-known that a diverse corpus is critical for training large language models, which are typically constructed from a mixture of various domains. In general, previous efforts resort to either sampling training data from different domains with static proportions or dynamically adjusting these proportions during training to optimise pretraining performance. However, few methods addressed the complexities of domain-adaptive continual pre-training. To fill this gap, we propose Velocitune, a novel framework that dynamically assesses learning velocity and adjusts data proportions accordingly, favouring slower learning domains while de-emphasising faster learning ones, which is guided by a scaling law to estimate the desired learning goal for each domain with a less associated cost. To evaluate the effectiveness of Velocitune, we conduct experiments on a dataset focused on reasoning tasks with CodeLlama, as well as on a corpus of system commands using Llama3 and Mistral. Velocitune achieves performance gains in both math and code reasoning tasks and command-line generation benchmarks. Further analysis reveals that key factors driving Velocitune’s effectiveness include target estimation and data ordering.
2024
NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism
Miao Li | Ming-Bin Chen | Bo Tang | Shengbin Hou | Pengyu Wang | Haiying Deng | Zhiyu Li | Feiyu Xiong | Keming Mao | Peng Cheng | Yi Luo
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Miao Li | Ming-Bin Chen | Bo Tang | Shengbin Hou | Pengyu Wang | Haiying Deng | Zhiyu Li | Feiyu Xiong | Keming Mao | Peng Cheng | Yi Luo
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism. Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence, and it comprises manually and carefully designed 1,267 test samples in the types of multiple choice questions and short answer questions for five editorial tasks in 24 news domains. To measure performances, we propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence, and both are validated by the high correlations with human evaluations. Based on the systematic evaluation framework, we conduct a comprehensive analysis of eleven popular LLMs which can handle Chinese. The experimental results highlight GPT-4 and ERNIE Bot as top performers, yet reveal a relative deficiency in journalistic safety adherence in creative writing tasks. Our findings also underscore the need for enhanced ethical guidance in machine-generated journalistic content, marking a step forward in aligning LLMs with journalistic standards and safety considerations. The evaluation framework and experimental results are expected to provide an in-depth understanding of the editorial capabilities of LLMs and speed up the development of LLMs in journalism.
2020
Chinese Grammatical Error Correction Based on Hybrid Models with Data Augmentation
Yi Wang | Ruibin Yuan | Yan‘gen Luo | Yufang Qin | NianYong Zhu | Peng Cheng | Lihuan Wang
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications
Yi Wang | Ruibin Yuan | Yan‘gen Luo | Yufang Qin | NianYong Zhu | Peng Cheng | Lihuan Wang
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications
A better Chinese Grammatical Error Diagnosis (CGED) system for automatic Grammatical Error Correction (GEC) can benefit foreign Chinese learners and lower Chinese learning barriers. In this paper, we introduce our solution to the CGED2020 Shared Task Grammatical Error Correction in detail. The task aims to detect and correct grammatical errors that occur in essays written by foreign Chinese learners. Our solution combined data augmentation methods, spelling check methods, and generative grammatical correction methods, and achieved the best recall score in the Top 1 Correction track. Our final result ranked fourth among the participants.