Yizhou Ying


2025

Data efficiency is crucial in domain-specific continual pre-training (CPT) of large language models (LLMs), especially under resource constraints. Aiming for “small data, big impact,” this work addresses the limitations of existing domain-specific data selection strategies, which often rely on scarce labeled data or computationally expensive LLMs. We introduce CDF Sampling with Grammatical Complexity (CDF-GC), an annotation-independent, efficient and interpretable data selection framework for CPT. Our approach comprehensively evaluates grammatical complexity using lexical diversity and syntactic complexity, and employs a cumulative distribution function (CDF)-based sampling strategy to balance complexity and diversity. To validate the effectiveness of CDF-GC, we conducted experiments on a financial dataset. The results demonstrate that CDF-GC significantly outperforms baselines, achieving 2.0% improvement in financial QA at the same selection ratio and even surpassing full-data training by 1.7% using only 20% of the data.
Despite the rapid development of large language models (LLMs), existing benchmark datasets often focus on low-level cognitive tasks, such as factual recall and basic comprehension, while providing limited coverage of higher-level reasoning skills, including analysis, evaluation, and creation. In this work, we systematically assess the cognitive depth of popular LLM benchmarks using Bloom’s Taxonomy to evaluate both the cognitive and knowledge dimensions.Our analysis reveals a pronounced imbalance: most datasets concentrate on “Remembering” and “Understanding”, with metacognitive and creative reasoning largely underrepresented. We also find that incorporating higher-level cognitive instructions into the current instruction fine-tuning process improves model performance. These findings highlight the importance of future benchmarks incorporating metacognitive evaluations to more accurately assess and enhance model performance.
"以人类的笑话文本为基础,比较评测了4个大语言模型生成幽默笑点句的能力。总的来看,目前DeepSeek-R1的中文幽默生成能力强于GPT-4o、Qwen2.5-7B和Qwen3模型 , 但 距 离 人 类 的 幽 默 能 力 还 有 明 显 的 差 距 。 各 模 型 基 于 固 定 表 达 生 成 笑 点 句时,或多或少存在“思维定势”问题。测查了人类与大语言模型幽默文本的9项语言特征。DeepSeek与人类的相似笑点最多,BLEU-4匹配度也最高。与人类相比,AI生成的笑点句更倾向于使用高频常见的词,未登录词、网络新词的比例更低,在长度上普遍更长。基于Sentence-BERT模型获取语义表示,大模型的笑点句在语义联想距离上普遍比人类的笑点句更短。强化谐音双关、语义双关等修辞手法的运用,是大模型提高幽默文本生成能力的重要途径。最后,我们讨论了本文评价方式的优劣,并展望了增强大模型幽默能力的3个策略:优化提示工程、构建幽默多模态大模型、在推理中增强幽默文本的可解释。"