Junhui Zhu

Also published as: 君辉


2025

"本研究针对大语言模型(LLMs)生成例句的教学适用性问题,基于二语习得认知理论构建了多维例句质量评估体系,涵盖规范性、语境独立性、典型度、词汇适切性及句法复杂度五大核心维度。通过采集汉语词典与教材的优质例句作为基准语料,结合特征工程构建了机器学习模型(准确率为98.6%),验证了评估框架的有效性。在此基础上,本研究利用该评估框架对LLMs生成例句与传统人工编纂词典中的例句进行了系统对比分析。研究结果表明:LLMs在语法典型度、词汇难度、汉字笔画数方面展现出与传统词典例句相当的质量水平,而在语境独立性、语义典型度、词汇常用度方面仍存在一定不足。进一步研究发现,不同提示策略影响例句生成质量,其中融合语言特征约束型提示策略优化效果最佳。本研究首次实现LLMs生成例句教育适应性的量化评估,为智能语言教辅系统开发提供了兼具理论指导意义与实践应用价值的评估范式。"

2023

“基于自然语言生成技术的聊天机器人ChatGPT能够快速生成回答,但目前尚未对机器作答所使用的语言与人类真实语言在哪些方面存在差异进行充分研究。本研究提取并计算159个语言特征在人类和ChatGPT对中文开放域问题作答文本中的分布,使用随机森林、逻辑回归和支持向量机(SVM)三种机器学习算法训练人工智能探测器,并评估模型性能。实验结果表明,随机森林和SVM均能达到较高的分类准确率。通过对比分析,研究揭示了两种文本在描述性特征、字词常用度、字词多样性、句法复杂性、语篇凝聚力五个维度上语言表现的优势和不足。结果显示,两种文本之间的差异主要集中在描述性特征、字词常用度、字词多样性三个维度。”
“Language teachers spend a lot of time developing good examples for language learners. For this reason, we define a new task for language learning, lexical complexity controlledsentence generation, which requires precise control over the lexical complexity in thekeywords to examples generation and better fluency and semantic consistency. The chal-lenge of this task is to generate fluent sentences only using words of given complexitylevels. We propose a simple but effective approach for this task based on complexityembedding while controlling sentence length and syntactic complexity at the decodingstage. Compared with potential solutions, our approach fuses the representations of theword complexity levels into the model to get better control of lexical complexity. Andwe demonstrate the feasibility of the approach for both training models from scratch andfine-tuning the pre-trained models. To facilitate the research, we develop two datasetsin English and Chinese respectively, on which extensive experiments are conducted. Ex-perimental results show that our approach provides more precise control over lexicalcomplexity, as well as better fluency and diversity.”

2022

The construct of linguistic complexity has been widely used in language learning research. Several text analysis tools have been created to automatically analyze linguistic complexity. However, the indexes supported by several existing Chinese text analysis tools are limited and different because of different research purposes. CTAP is an open-source linguistic complexity measurement extraction tool, which prompts any research purposes. Although it was originally developed for English, the Unstructured Information Management (UIMA) framework it used allows the integration of other languages. In this study, we integrated the Chinese component into CTAP, describing the index sets it incorporated and comparing it with three linguistic complexity tools for Chinese. The index set includes four levels of 196 linguistic complexity indexes: character level, word level, sentence level, and discourse level. So far, CTAP has implemented automatic calculation of complexity characteristics for four languages, aiming to help linguists without NLP background study language complexity.