Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

Maosong Sun, Peiyong Duan, Zhiyuan Liu, Ruifeng Xu, Weiwei Sun (Editors)


Anthology ID:
2025.ccl-1
Month:
August
Year:
2025
Address:
Jinan, China
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
URL:
https://aclanthology.org/2025.ccl-1/
DOI:
Bib Export formats:
BibTeX MODS XML EndNote

"针对汉语词义辨析选择题的自动生成任务,本文提出一种基于检索增强生成(RAG)技术的智能出题框架。该框架通过构建融合词汇等级、词频与句子长度的多维度难度评估模型,实现习题难度的个性化控制。研究通过整合语言要素知识库与BCC语料库,有效提升语境自然性与干扰项质量,并引入格式校验、逻辑验证与答案唯一性检测的多维校验机制,确保输出题目符合教学规范。实验结果显示,该方法在出题成功率、答案正确率与内容多样性等关键指标上显著优于传统微调模型,展现出良好的教学适配性与应用潜力,为汉语教学智能化提供新的技术路径。"
"高资源语言的神经机器翻译虽已取得显著进展,但低资源语言面临更严重的平行数据不足的问题。为此,提出一种面向藏汉神经机器翻译的多样性数据重组增强方法(DiRec)。该方法利用大语言模型的双向语言能力,对已有藏汉平行数据进行成分重组、句型重组和风格重组三种数据重组,经过两轮质量自动筛选后得到多样性增强数据。在藏汉机器翻译的实验中,相较于基线模型,基于DiRec的模型的泛化能力指标提升4.83个百分点,BLEU提高0.55,chrF++提高0.20。最后分析了不同数据重组方式对翻译模型性能的影响。"
"以人类的笑话文本为基础,比较评测了4个大语言模型生成幽默笑点句的能力。总的来看,目前DeepSeek-R1的中文幽默生成能力强于GPT-4o、Qwen2.5-7B和Qwen3模型 , 但 距 离 人 类 的 幽 默 能 力 还 有 明 显 的 差 距 。 各 模 型 基 于 固 定 表 达 生 成 笑 点 句时,或多或少存在“思维定势”问题。测查了人类与大语言模型幽默文本的9项语言特征。DeepSeek与人类的相似笑点最多,BLEU-4匹配度也最高。与人类相比,AI生成的笑点句更倾向于使用高频常见的词,未登录词、网络新词的比例更低,在长度上普遍更长。基于Sentence-BERT模型获取语义表示,大模型的笑点句在语义联想距离上普遍比人类的笑点句更短。强化谐音双关、语义双关等修辞手法的运用,是大模型提高幽默文本生成能力的重要途径。最后,我们讨论了本文评价方式的优劣,并展望了增强大模型幽默能力的3个策略:优化提示工程、构建幽默多模态大模型、在推理中增强幽默文本的可解释。"
"法律事件检测任务旨在识别并分类法律文本中的事件。然而,复杂的法律案件使得收集高质量标注数据面临巨大挑战。目前领域数据标注主要依赖人工,成本高昂且耗时。尽管传统的主动学习能够减少部分标注需求,但仍依赖于人工干预。大模型的发展为自动化数据标注带来了可能性,但如何确保标注的可靠性仍是亟待解决的问题。为此,本文提出了创新的协作训练范式,使用主动学习迭代选择训练数据,并利用大模型生成高质量标注,使用评估筛选机制保留高质量标注,大幅减少了人工标注的工作量。在两个事件检测基准数据集上的实验表明,该方法在低资源场景下显著降低了人工标注需求,在部分情况下可以接近监督学习的性能。"
"开放域问答通常是从大规模数据中检索多个相关文档,并利用大语言模型对文档内容进行理解生成答案。然而,面向缅甸语、老挝语等低资源语言,检索到的数据可能存在问题无关的噪声文档,且大语言模型对低资源语言理解能力弱,生成答案错误率高。对此,提出一种基于多维度答案筛选的低资源语言开放域问答方法,将现有基于大模型直接理解文档生成答案的过程,转换成多个候选答案生成并筛选的多阶段过程。在答案生成阶段,从文档中抽取多样化的候选答案,在筛选阶段,设计多维度答案筛选策略,通过全局篇章答案验证、局部证据答案验证以及不同答案相关性排序,筛选出最优答案。在四种东南亚低资源语言开放域问答数据集上的实验结果表明,基于GPT-4o-mini、DeepSeek-V3等大语言模型底座,提出方法相比思维链、摘要验证等最优方法都取得了更好的性能,验证了多阶段答案生成筛选过程在低资源开放域问答任务中有效性。"
"司法领域中的实体关系联合抽取在许多下游任务中(如量刑预测、知识库构建等)具有重要意义。然而,由于垂直领域中的数据资源稀缺,而且司法文本中存在复杂的长句以及关系重叠现象,这使得信息抽取工作颇具挑战性。为应对这一挑战,我们首先标注了一个包含多个罪名的司法领域的专有数据集,然后提出了一种基于三元组区域顶点的联合抽取填表法。我们采用多标签分类对三元组的边界进行标注,以此提取三元组,从而充分利用实体的边界信息。此外,为融入实体对之间的距离信息,我们引入了距离嵌入,并采用扩张卷积来捕捉多尺度上下文信息。我们在司法数据集上对模型进行了评估。实验结果表明,我们的模型在这个数据集上均取得了最先进的性能。"
"本文针对两阶段事件共指消解方法存在的触发词词目启发机制缺乏同义词聚类能力和小模型理解触发词指代事件能力有限等问题,提出了一种基于大模型增强的两阶段高效的事件共指消解方法,一阶段引入大模型进行同义词聚类,二阶段大模型提供触发词解释文本增强小模型。此外,设计了引导小模型侧重触发词特征向量的损失函数。本文方法在保持近似线性时间复杂度的同时,在ECB+和GVC数据集上的CoNLLF1得分分别提升了2.9和8.0。"
"多模态知识图补全(MMKGC)通过融合实体间的结构化语义信息与多模态特征,从给定的多模态知识图谱(MMKG)中发现未观察到的潜在事实。然而,现有方法普遍忽略了实体表示过程中不同模态的交互,同时缺乏对补全过程中模态之间互补性的关注。为了解决这些不足,我们提出了一种新的模型MIDF(模态交互和决策融合)来处理多模态的交互和互补。该模型首先设计了一个实体多模态交互融合模块,将实体的图像和文本特征提前交互后,再与结构特征进行融合,充分学习实体的嵌入。为了在补全过程中进一步利用不同模态之间的互补性,我们设计了关系引导的决策融合模块。通过使用不同模态的预测结果以及关系引导的权重,进一步利用模态的互补性,融合预测结果。在DB15K和MKG-W上的广泛实验证明,我们的MIDF优于现有的最先进的模型,证明了我们方法的有效性。"
"从中文文本中准确识别医学命名实体是实现中文医疗信息结构化的关键。传统机器学习方法在面对中文医学实体边界模糊和嵌套结构复杂等问题时效果有限。本文提出一种基于大语言模型的中文医学命名实体识别方法,首先通过任务重构将识别过程转化为文本生成任务,设计了适配的标注策略以统一处理平面与嵌套实体,然后引入实体筛选器过滤错误候选实体,最后通过大语言模型决策进行冲突消解与多模型集成提升系统整体鲁棒性。在CMeEE-V2与CCKS2019两个数据集上实验结果显示,所提方法在识别准确性与鲁棒性方面均达到当前先进水平,F1值分别为0.7785和0.8821。"
"本研究针对大语言模型(LLMs)生成例句的教学适用性问题,基于二语习得认知理论构建了多维例句质量评估体系,涵盖规范性、语境独立性、典型度、词汇适切性及句法复杂度五大核心维度。通过采集汉语词典与教材的优质例句作为基准语料,结合特征工程构建了机器学习模型(准确率为98.6%),验证了评估框架的有效性。在此基础上,本研究利用该评估框架对LLMs生成例句与传统人工编纂词典中的例句进行了系统对比分析。研究结果表明:LLMs在语法典型度、词汇难度、汉字笔画数方面展现出与传统词典例句相当的质量水平,而在语境独立性、语义典型度、词汇常用度方面仍存在一定不足。进一步研究发现,不同提示策略影响例句生成质量,其中融合语言特征约束型提示策略优化效果最佳。本研究首次实现LLMs生成例句教育适应性的量化评估,为智能语言教辅系统开发提供了兼具理论指导意义与实践应用价值的评估范式。"
"中文相较于以英文为代表的表音文字具有富语义的特点,单个汉字蕴含了读音、字形结构、偏旁部首等丰富的语义特征,在构建自然语言处理相关应用时具有独特的价值,可以视作额外的特征,提升在特定任务的表现。近年来,大语言模型飞速发展,展现出海量的知识储备和强大的推理能力,其中,大模型对汉字富语义特征的掌握可以视作大模型中文能力的基础。然而,目前对于大模型汉字富语义能力评测研究较少,针对性地评测大模型在汉字富语义方面的能力边界,有助于了解大模型中英文能力差异性、并推测大模型在字形、字音相关下游任务上的表现。因此,本研究从汉字的结构、偏旁、读音、笔画、多音字和部件六个维度,对大语言模型进行了全面评测,旨在深入探究其对汉字基本富语义特征的掌握程度。本研究以GB2312 标准字符集和现代汉语词典为依据,围绕汉字的结构、偏旁、读音、笔画、多音字和部件六个维度,构建了一系列“问题-答案”对,并制定了科学合理的评分标准。在此基础上,对十余种主流的大语言模型进行了深入评测。同时,为探究模型在中英文能力上的差异,将上述中文评测任务翻译为英文,并选取了三个代表性模型进行对比评测。此外,本研究进一步从汉字结构推理、偏旁推理、读音推理三个关键角度出发,设计了一系列推理评测任务,旨在深入评估大语言模型对汉字富语义特征的推理能力。本研究的评测结果具有重要的参考价值,可为大语言模型相关领域的研究人员在中文下游任务优化、基础模型选择等关键环节提供参考和启发。"
"检 索 增 强 生 成 (Retrieval-Augmented Generation,RAG) 是 一 种 有 效 优 化 大 语 言模 型 在 工 艺 规 范 问 答 任 务 中 性 能 的 方 法 。 然 而 , 基 于 固 定 文 本 长 度 分 块 的 朴素RAG(Naive RAG)在构建工艺规范问答任务时表现不佳。主要原因在于工艺规范是一类复杂的技术文档,采用固定文本长度分块会丢失工艺规范段落层级之间的结构关系以及隐含的知识关联关系,导致输出结果质量下降。因此,本文提出了一种利用工艺规范篇章段落间隐含的树结构关系来构建RAG的方法,该方法有效解决了固定文本长度分块导致的段落之间的知识关联丢失问题。实验结果表明,树结构RAG在评价指标上优于朴素RAG,其中ACC平均提升3.81%,ROUGE-L提升3.28%,BLEU-4提升2.97%,验证了树结构RAG的有效性。"
"常识推理任务是指模型利用日常经验知识对隐含信息进行推断,从而理解和预测现实世界中的合理情境。当前研究趋势之一是通过引入外部知识库来获得额外的背景知识。然而现有的常识推理模型存在引入的外部信息不够精准和融合不充分的问题,致使其在实际应用中的表现不佳。针对上述问题,本文提出了一种基于检索增强生成的两阶段常识推理方法。该方法基于维基百科构建了包含6.28M篇文章的知识库,使用检索增强生成方法,赋予模型语义相关的上下文作为补充信息,辅助模型推理。同时,为了节省时间和资源,本文提出了一种两阶段推理策略,将简单问题交由小模型处理,将复杂问题交由大模型完成。在OpenBookQA等多个数据集上的实验结果证明,本文方法展现出优越的性能,而且适配不同的骨干网络和大模型,可做到即插即用。"
"近年来,大型语言模型如ChatGPT显著提高了机器对自然语言的理解能力,其中,问答推理任务在推动语言理解能力和人机交互智能化方面具有重要意义,但目前仍面临诸多挑战。本文针对现有大模型资源消耗大、小模型推理能力弱,低资源语言推理能力受限等问题,提出了融合思维链和微调技术的方法,通过Human-Thinking提示策略优化大模型推理能力,并借助大模型指令微调提升小模型推理性能,引入多角色协作机制进一步优化推理步骤质量。通过探索跨语言思维链提示方法,利用高资源语言知识弥补低资源语言不足,采用双通道机制和投票打分机制整合不同语言推理知识,提升模型在低资源语言的推理表现。实验结果表明,本文方法能有效提升小型模型在多语言问答推理的能力,具有一定的研究价值。"
"汉语框架语义解析基于框架语义学理论,旨在通过识别句子中词语所激活的语义框架, 分析句子中各个成分的语义角色, 从而揭示语言背后的深层语义结构,进一步更好地抽取事件关系和语境信息。 大语言模型出现后,其强大的通用文本理解与生成能力被广泛应用于各种自然语言处理任务中。 然而,当前大语言模型在汉语框架语义解析任务中存在推理路径简单、 准确率过低的不足,尤其在思维链的逻辑连贯性和检索增强生成的深度应用上存在欠缺。 为此,本文提出了一种面向汉语框架语义解析的思维提示方法。 该方法结合检索增强生成(RAG)与链式思维(CoT)技术,引导大语言模型完成汉语框架语义解析任务。我们在CFN2.1数据集上的实验结果表明,与最好方法相比,该方法的框架识别准确率提升13.52%,论元识别F1提升2.24%,角色识别F1提升5.09%。"
"讽刺和隐喻是文学与语言表达中常见的修辞手法,以往相关研究多聚焦于分类任务上,且更多的基于英文数据进行探索。随着大模型与多模态大模型的不断涌现,模型对各种自然语言处理任务与多模态任务的处理能力得到了显著的提高。本文利用GPT-4o进行自动数据合成,来训练多模态大模型,实现了图文多模态讽刺隐喻综合理解任务。本文训练出能理解图片或图文讽刺隐喻内容,并进行详细解释或配文的参数量较小的多模态大模型,并且保证了模型具备良好的鲁棒性和通用性能。本文精心设计了数据构造方法,包括数据源的选择,指令数据的合成,回复数据的合成,来获得了一批高质量的多模态讽刺隐喻指令微调数据。我们选用了当前表现较好的多模态大模型作为骨干模型,使用合成数据并结合公开多模态图文数据集进行训练。在模型评测方面,本文分别从讽刺隐喻理解能力和通用能力进行评测,验证了模型的可用性。本文的数据以及模型权重将在后续放置在https://github.com/652897698/Multimodal-LLMs-for-Sarcasm-and-Metaphor-Undrerstanding"
"针对中文评价对象抽取缺少评价名词的专门研究和深层情感知识本体的问题,提出构式知识本体驱动的评价名词—评价对象抽取方法。根据评价对象在情感构式中充当何种句法成分,归纳概括出主语型、定语型、同位语型等九种情感构式;提炼九种情感构式的意义模式,精准指出机器自动识别每一意义模式中的评价对象所依据的形式特征;定义形式符号与逻辑运算规则,把九种模式及其形式特征转化为机器可读的形式语言;创建语义词典与情感构式规则库,编程实现为评价名词—评价对象智能抽取系统CUCNsas。实验结果表明:CUCNsas在1万条《人民日报》和《新闻联播》测试语料上的准确率为88.3%、召回率为82.1%、F1值为85.1%。在中文评价名词—评价对象抽取任务上,着眼于句子整体形-义配对关系的构式语法,相较于语义特征法、短语结构文法和依存语法更具优势。"
"在缺乏人工参考译文对照的情况下,如何自动地评估机器译文的质量?现有一种机器译文质量估计方法利用异构翻译系统对源语言句子进行直接翻译,把生成的译文作为伪参考译文,将机器译文和伪参考译文进行对比来评估机器译文的质量。为了使生成的伪参考译文能够帮助机器译文质量估计方法准确地识别当前机器译文中存在的错误,本文提出引入反思机制的伪参考译文生成方法,并将其应用在机器译文质量估计任务中。生成伪参考译文的异构翻译系统是一个反思智能体,该反思智能体将待评估机器译文作为生成伪参考译文过程中的关键元素,它的推理步骤包括对机器译文进行回译、对源语言句子和回译进行智能反思、基于反思结果生成对机器译文的修正意见以及生成候选伪参考译文。在WMT’23句子级别机器译文质量估计任务基准数据集上的实验结果表明,所提方法显著提高了机器译文质量估计的效果。"
"信息论视角的语言研究揭示了语言系统中普遍存在的效率与易学性的认知约束。本研究探讨了现代汉语中同音字家族系统的认知约束,发现(1)在系统内部,家族效率与易学性正相关;(2)相比计算模拟系统和拼音化系统,同音字家族系统易学性虽较低,但效率更高;(3)无论是否考虑声调、声符和生僻字,家族系统均表现出上述特点。结果表明,汉语同音字家族系统对效率与易学性进行了权衡,揭示了其庞大规模形成背后的认知机制。"
"随着大语言模型在多任务学习领域展现强大泛化能力,其在低资源古汉语场景的应用价值亟待探索。本文基于LLaMA3-Chinese-8B利用21GB高质量古汉语语料进行增量预训练,接着进行十项任务微调(包括句读、词性标注、命名实体识别(NER)、事件识别、翻译、词语解释、反向词典、历史人物知识、诗歌赏析、诗歌生成),设计了单任务微调和双任务组合微调两种策略,通过55组实验量化了任务之间的正增益与负增益,首次系统揭示了古汉语多任务学习中的增益关系。实验结果表明,不同任务之间存在协同效应与任务干扰效应,并且具有不对称性。基础类古汉语任务之间表现出更强的协同效应,相比之下,翻译类和生成类任务之间协同效应表现较弱。同时,受双任务设定的影响,不同古汉语任务的稳定性存在明显差异。"
"预训练语言模型通过大规模无监督学习在多任务场景展现卓越性能,但其研究多集中于中英文等高资源语言。藏语等低资源语言因数据稀缺及形态复杂(黏着语特性、音节结构多样),导致主流子词分词方法存在语义割裂与形态失配问题,制约模型训练效率与表征质量。为此,本文提出基于拉丁化编码的藏文扩展分词策略TibLex(Tibetan Latinization-based Extended Tokenizer)该方法通过将输入文本进行编码转写,将每个藏文音节根据其字形或发音转换为一个短序列,然后基于编码文本使用子词分词构建词汇表。实验表明,TibLex相较主流分词器具有双重优势:(1)通过拉丁化降维处理,使词表不规则组合减少15%,输入序列长度平均缩短36.10%,显著提升计算效率。(2)音译分词器可将同音异形字编码为相同音译序列并输出一致的分词结果,从而实现对同音错别字的鲁棒性处理。与此同时,基于TibLex训练的预训练模型在下游任务中保持竞争力,验证了该方法在低资源语言场景的有效性。本工作为解决形态复杂语言的分词瓶颈提供了新范式,其编码框架可扩展至蒙古文、梵文等文字系统,为跨语言NLP研究提供技术支撑。"
"扩散模型作为新一代生成模型,在文本引导图像生成任务中展现出卓越性能。然而,现有预训练扩散模型的训练目标通常无法直接对齐用户偏好或下游任务需求,导致其生成结果难以兼顾图文语义一致性与主观美学质量。为此,近年来研究者提出将强化学习引入扩散微调过程,使模型在奖励信号引导下优化生成策略,代表性方法如策略优化扩散模型与去噪扩散策略优化已取得显著成果。然而,此类方法所依赖的奖励函数多为黑盒式打分器,难以捕捉生成图像与输入文本之间的结构性语义关系,缺乏对模态间对齐结构的显式建模。为解决上述问题,本文提出一种融合强化学习与结构对齐正则的文本引导扩散模型微调方法GARD(Geometry-Aligned Reinforced Diffusion)。该方法在强化学习微调框架下,引入一种基于嵌入空间几何结构的对齐正则项,即通过计算图像与文本嵌入向量构成的平行多面体体积,衡量其语义对齐程度,并与奖励信号与散度正则共同构成统一优化目标,从而在提升生成质量的同时增强多模态语义一致性。实验结果表明,GARD 在多个公开数据集上相较于现有方法在语义一致性、审美得分与训练稳定性等方面均实现显著提升,验证了本文方法在多模态结构对齐建模与强化学习微调融合方面的有效性与通用性。"
"传统立场检测通常假设目标已知,且仅输出立场类别(支持,反对,中立),难以应对目标不确定、立场判断需要有具体依据的情形。为此,本文提出目标自适应的可解释立场检测新任务,定义模型的输出为目标、观点和立场标签。具体地,构建了首个中文高质量立场检测数据集,并设计多维评估标准;评估了多种大语言模型的基线性能。实验发现:DeepSeek-V3在目标识别与立场分类表现最优,GPT-4o在观点生成上领先;大语言模型在目标明确时具备较强目标自适应能力,但处理存在反讽现象的输入时性能下降。数据集和实验结果公布于https://github.com/Cassieyy1102/TAISD。"
"大语言模型在知识密集型任务中的表现高度依赖其内化知识的覆盖面和掌握程度。然而,当前缺乏系统化、细粒度的评测方法以刻画模型对不同类别知识的掌握能力。为此,本文提出一种基于提示探针的方法,系统评估大语言模型在常识性知识、事实性知识和专业领域知识方面的掌握情况。首先构建了一个高质量的知识探针评测数据集KPE-Pro(Knowledge Probing Evaluation for Proficiency)。然后设计提示模板对多个主流大语言模型进行系统评测。评测结果表明,大语言模型在常识性知识方面表现较好,ERNIE X1模型取得整体最好成绩;在事实性知识上,大语言模型的表现较弱,轻量模型的知识掌握能力明显不足。评测数据公开于:https://github.com/cyuu313/KPE-Pro。"
"知识图谱推理(KGR)旨在通过对知识图谱中蕴含的逻辑规则进行挖掘和应用,进而推断和发现新事实。该任务广泛应用于智能问答、语义搜索和推荐系统等领域。近年来,由于基于嵌入的知识图谱推理算法缺乏可解释性,一些研究者开始研究基于规则的知识图谱推理方法。然而,现有基于规则的推理方法在理解关系语义时难以处理关系之间的隐式关联信息且容易陷入局部最优解。为此,本文提出了一种基于关系结构感知增强的规则挖掘模型ReSA。该方法通过构建关系图,显式地建模关系之间的层次结构,提高规则挖掘的效率。同时,ReSA还通过全局规则融合模块和相对关系编码器,结合全局语义建模和局部结构建模,增强模型对规则体整体逻辑的感知能力。实验表明,ReSA模型在WN18RR等数据集上取得了显著的性能提升,MRR指标相较于现有最新规则挖掘方法提升了4个百分点。"
"本文提出了一种多智能体协同的干扰数据生成框架,旨在评测分析大语言模型在复杂干扰下的鲁棒性。该框架以数学领域为起点,逐步扩展至医学、法律、科学及通用场景,构建了涵盖拼写干扰、数字干扰、类型干扰与谣言干扰四类干扰的跨领域数据集AntIF,共计近5000条数据。在此基础上,本文对主流开源语言模型进行了系统的抗干扰能力评估,并结合不同的提示工程策略与模型微调方法,深入分析了AntIF 在提升模型鲁棒性方面的实际效果。"
"本文针对中文排比句研究面临的高质量语料匮乏和细粒度标注缺失两大挑战,构建了一个包含主题、情感基调、排比标志词和关键词多维标注的中文排比句语料库。基于此,本文提出了一种基于关键词引导的思维链排比句生成框架K-CoT,通过模拟人类修辞创作的认知过程,将排比句生成分解为“主题解构-特征映射-关键词生成-句式合成”的渐进式推理流程。在ChatGLM和Llama等主流模型上的实验表明,本文提出的K-CoT在排比句生成任务上取得了显著的性能提升。本文为排比句研究提供了一个新颖的数据集,也为生成模型的修辞能力优化提供了可解释的技术路径,其分阶段推理机制对提升语言模型的语义可控性具有普适意义。"
"随着对手语进行大规模数据化处理的需求日益增强,手语的音系学标注及规范化工作愈发迫切。然而,手语作为一种视觉-空间语言,不同于有声语言,其多信道(手形、位置、手掌朝向、运动方式以及面部表情、躯干动作等非手动特征)信息的复杂性与缺乏统一标注规范,一直制约着手语语料库构建与自动分析技术的发展。针对这一问题,本研究在手语音系学理论的指导下,提出了一套面向中国手语音系学标注加工的系统化规范。该规范由原则和细则两部分构成:原则部分明确标注对象的粒度、标注单位的界定与分层方式;细则部分则给出多信道特征的具体标注实例与操作指南。该规范的实施为中国手语多信道特征的系统标注提供基础支撑,将有助于推动手语识别、翻译、生成以及教学平台的深入发展,加速中国手语信息处理标准化与规范化的进程。"
"嵌套命名实体识别(NER)是自然语言处理中一个基本任务,其目的是通过计算机辅助技术识别并提取嵌套实体及其对应语义类型。目前嵌套命名实体识别的主流研究方法是基于跨度的方法,该方法将实体识别视为一个跨度分类任务,可以有效地处理嵌套实体。然而,基于跨度的嵌套命名实体识别方法无法准确区分相似实体之间的细微语义区别。并且通过枚举的方式会产生大量噪声跨度,影响模型性能。针对上述问题,本文提出一种方法,既能够量化模型预测的不确定性,通过不确定性辅助模型的推理,降低噪声跨度对模型性能的影响,还能通过局部语义区分模块区分出实体间的语义区别。具体来说,针对噪声跨度对模型性能产生影响的问题,本文设计了一种不确定度引导的KNN辅助决策机制,用于在不确定性较高时对预测结果进行校正。此外,针对嵌套命名实体识别模型对实体边界模糊与语义重叠问题的识别能力不足,利用局部语义区分模块,通过建模当前跨度与邻域跨度的表示差异,引导模型关注细粒度语义差异,从而提升嵌套实体的识别准确性。该方法在GENIA 英文数据集和自建中文嵌套数据集上分别取得了81.27%和82.26%的F1 值,对比基线模型分别提升了0.52%和1.48%的F1值,验证了它对嵌套命名实体识别任务的有效性。"
随着互联网在儿童群体中的广泛普及,新闻内容的”毒性遗留”与价值观缺失已成为亟待解决的安全挑战。本文提出了一种多模型协同的儿童新闻改写框架(CRV-LLM),旨在从词汇、事件、标题和价值观四个维度,对原始新闻文本进行深度风险识别与精准改写。CRV-LLM集成了四个轻量化风险检测模型和R1-Distill-Qwen-32B改写模型,通过模型间的协同与反馈,能够在保证儿童可读性的前提下,有效剔除潜在有害信息并植入积极价值引导。实验结果表明,CRV-LLM框架在安全性、教育性等核心指标上优于主流模型,且推理效率提升62%,为儿童互联网内容安全管理提供了一种高效、可扩展的技术方案。
"查询扩展旨在通过丰富查询来提升检索效果。在大语言模型结合伪相关反馈的查询扩展方法中,伪相关文档中的噪声及不连贯信息严重影响了大语言模型的扩展质量。为此,本文提出一种大语言模型和知识图谱协同的查询扩展方法(LKQE)。LKQE 首先检索出相关文档并提取关键句,然后利用大语言模型从中抽取知识三元组,并补全实体关系构建出知识图谱,最终在知识图谱指导下生成高质量扩展文本。实验结果表明,与基线模型相比,LKQE 在 DL19 与 DL20 数据集上的表现具有显著优势。"
"近年来,非自回归图像描述生成技术凭借其双向传播和并行词语生成的能力受到广泛关注。与此同时,基于离散扩散方法的研究也取得了显著进展。然而,在离散噪声添加与去噪过程中,现有方法仍面临图像文本关联性低、目标物体遗漏、描述准确性不足以及词语重复等关键问题。为应对这些挑战,我们提出一种基于语义感知的离散扩散模型。该模型通过可学习查询机制构建语义感知模块,以捕捉与图像物体级语义特征的潜在关联从而更好地生成图像描述。在此基础模型之上,我们进一步引入自提示优化框架,利用大语言模型生成与图像细节内容更相符的丰富描述。在COCO数据集上的综合实验表明,本方法在图像描述任务中取得一定的提升,其性能优于现有的相关方法。"
"企业新闻事件抽取是支撑企业动态分析与产业决策的关键技术。企业新闻事件抽取具有文本篇幅较长,内容多元化的特点,面临多事件抽取和论元分散等核心挑战。大语言模型(Large Language Model,LLM)虽然具有强大的长距离依赖建模和语义关联能力,但通用大语言模型难以满足企业级应用对专业性与资源效率的需求。本文提出了融合MoE的多任务学习企业新闻事件抽取模型(MoE-Enhanced Multi-Task Learning for Corporate News Event Extraction,MoE-ML-CNEE)。通过构建统一微调数据集与多任务联合训练范式,将事件检测与论元抽取构建为结构化语言模板,增强模型全局建模能力。设计MoELoRA模块,利用动态路由机制实现多专家网络在低秩空间的知识共享与特征解耦,进一步提升模型事件抽取性能。实验表明,MoE-ML-CNEE模型在ChiFinAnn和DuEE-fin公共数据集和自建企业新闻数据集的事件检测、事件论元抽取结果均优于现有基线模型。"
"医学命名实体识别在医疗信息提取和知识图谱构建中至关重要,但因医学领域的专业性和复杂性,面临数据稀缺、特征不显著及上下文利用不足的挑战。本文提出LLM-MedNER方法,充分利用大语言模型(LLM)的预训练知识,通过提示工程生成语义等价但表达多样的增强文本,并提取多维度特征,包括关键字集合、语义描述、词性信息及医学实体关联特征,从而显著提升模型的特征表达能力。方法采用双通道MacBERT-BiGRU编码模块并行学习原始文本特征与大语言模型增强特征,通过交叉注意力机制融合不同语义特征。随后,引入自适应多粒度扩张卷积层,通过不同膨胀率的一维卷积捕获多尺度的局部上下文信息,进一步丰富词表示。并在输出层引入Biaffine模块实现实体边界及类型的精准识别。对比实验表明,LLM-MedNER在多个医学命名实体识别数据集上的表现优于现有基线方法;消融实验进一步证实各模块的有效性。"
"中华优秀传统文化是提升我国新时代文化软实力的重要源泉,将传统价值观和成语相结合,有助于继承和弘扬我们的优秀文明。本文提出了传统价值观成语当代语境表现的研究框架,基于BCC语料库对传统价值观成语语料数量分布和成语传统价值观偏好分布特征、在当代语境中的情感倾向及高频词分布特点、社会话题及道德特征进行计量研究,并提出了传统价值观成语的当代社会话题及道德适应性指数,以系统研究传统价值观成语的当代语境表现。本文为传统文化的当代计量研究提供了新的视角,也为数字人文领域的相关研究提供了参考依据,旨在增强中华优秀传统文化在当今新时代的影响力,为中华文明的传承与创新作出贡献。"
"在“一带一路”倡议持续推进的背景下,中国与中亚国家交流日益深化,对高质量的跨语言信息处理技术提出了迫切需求。然而,中文与中亚国家语言之间的平行语料库资源极度匮乏,且现有资源质量参差不齐,严重制约了机器翻译、跨语言信息检索、情感分析等下游任务的发展。针对中亚国家低资源语言,本文提出一种融合神经机器翻译(NMT)与跨语言语义匹配的平行语料构建框架。该方法通过定向爬取中亚国家官方渠道的单语新闻数据,利用DeepSeek模型的多语言翻译能力生成伪平行句对,再通过LaBSE 模型获取跨语言句子嵌入向量,基于余弦相似度动态阈值和边距实现噪声过滤。实验表明,该方法在BLEU分数指标上比较传统回译方法提升了0.65,最终构建包含8 万句对的多领域平行语料库,覆盖政治、经济、文化等核心领域,该语料库为提升中亚低资源语言的机器翻译、跨语言信息检索、文本分类等下游任务的生成质量奠定了坚实的基础。"
"法律判决预测是法律人工智能领域的一项重要任务。本文提出了一种基于外部知识的可解释性双系统推理框架,来解决现有方法在刑期预测任务中精度不高且可解释性不强的问题。该框架借鉴认知科学领域的双系统理论,利用大型语言模型的文本理解和生成能力,模拟人类法官处理案件时的决策过程,最终给出具有清晰推理路径的刑期预测结果。此外,通过构建一个高质量思考增强数据集和一个外部法条知识库,提升了模型的解释能力并且有效地抑制法条判断模型出现法条幻觉。实验结果表明,该框架显著提升了CAIL-small和CAIL-big数据集中刑期预测子任务上的精度和可解释性。"
"大语言模型在多种自然语言处理任务中展现出强大的语义理解能力。现有研究通常基于各类语义解析数据集对大语言模型进行评估,然而,这些数据集难以覆盖对话语料中常见的口语化表达与特定结构表达语义的语言现象,无法有效评估大语言模型在对话场景中的细粒度语义理解能力。为此,本文面向对话语料构建了一个包含2146条语句、1748个构式的中文构式数据集,实现语义信息细粒度表达的同时有效覆盖了现有语义解析评估数据集的缺口。基于该数据集,本文选取了其中部分代表性构式,结合框架语义学理论,提出了构式识别与构式语义理解两项评测任务,以系统评估大语言模型在对话场景中识别构式与理解深层语义的能力。实验结果表明,当前大语言模型在构式识别方面仍存在明显不足;且在缺乏思维链推理的引导下,难以理解构式所承载的深层语义。"
"意图识别与槽位填充是口语理解中的两个子任务,联合建模这两项任务能够利用共享特征提升任务间的协同建模效果。然而,现有方法普遍缺乏对句子主题语义的显式建模,难以捕捉更充分的全局语义信息,尤其在多意图场景下系统建模性能下降严重。为缓解上述问题,本文提出了一种主题感知的意图识别与槽位填充联合建模方法,该方法构造了主题提取模块以学习句子主题分布表示,结合主题引导的意图和槽位表示增强网络插入主题信息,使得模型在识别句子意图和填充槽位过程中能够显式建模主题信息。实验结果表明,本文所提出方法在多意图公开数据集MixATIS和MixSNIPS上分别获得了50.9%和84.8%的整体准确率,相较多个基线模型取得了更优的性能表现。"
"本研究通过分析北京句子语料库的眼动数据,运用混合效应模型和贝叶斯分析方法,系统考察了信息密度在汉语阅读过程中的表现及其与视觉复杂度因素的交互作用。研究结果表明,信息密度对注视时长具有显著正向预测作用,信息密度越高的词汇,受试者的注视时间越长,这与预测编码理论中“预测误差”增加导致加工负荷增加的假设一致;同时,信息密度在跳读行为分析中显示出显著负向预测作用,表明信息密度较高的词越不容易被跳读,支持了读者依据信息分布动态分配注意力的“调节假设”。研究还发现了汉语阅读的语言特异性表现:首先,词长效应在中文中呈现与拼音文字不同的模式,长词在中文中更易被跳读;其次,视觉复杂度与语言预测性之间存在非线性交互,支持了“语言特定性假设”。基于这些发现,本研究提出了中文阅读的“双通道加工模型”,即语言预测(信息密度)与视觉编码(笔画数、词长)共同调节认知资源的动态分配,这一理论框架不仅解释了中文阅读的特异性机制,也为跨语言认知加工研究提供了新视角。"
"为 了 解 决 现 有 单 一 文 本 特 征 生 成 的 藏 文 摘 要 质 量 较 低 的 问 题 , 提 出 了 一 种 基于TiLamb的 多 模 态 生 成 式 文 本 摘 要 模 型——Ti-MISO。 该 模 型 采 用ViT(Vision Transformer)模型从图像中提取视觉特征,同时利用预训练微调的TiLamb(Tibetan Large Language Model Base)模型提取藏文文本特征,再通过跨模态交叉注意力机制实现图文特征深层次融合,最终将融合的特征送入模型,借助束搜索算法平衡生成质量更高的摘要。为验证方法有效性,与基于相同语料的其他四种模型进行了对比实验。实验结果表明,Ti-MISO在ROUGE-1、ROUGE-2、ROUGE-L和BLEU四项评价指标上均取得最佳成绩,显示出模型在融合视觉与语言信息、提升摘要质量方面的显著优势。此外,通过一系列消融实验进一步验证了采用ViT模型进行图像特征提取及交叉注意力融合策略的重要性。加入图像信息后采用交叉注意力机制进行特征融合,使融合后的特征保留更多关键信息,帮助模型更加精确地捕捉重点,从而生成的摘要在概括性和可读性上都有明显提升。"
"Whisper是一种强大的多语言语音识别模型,在英语等高资源语言上表现优异,但在缅甸语等部分低资源语言的性能仍受限于预训练数据的不足。为此,本文提出了一种基于自监督表征蒸馏的Whisper低资源语音识别优化方法。通过跨模型表征蒸馏机制,实现自监督模型表征向Whisper编码器的知识迁移,提升对缅甸语等语言的表征建模能力。实验结果表明,该方法在缅甸语、柬埔寨语、乌兹别克语和旁遮普语ASR任务中有效降低了字符错误率,验证了所提方法的有效性。"
"花园幽径句是在句法或语义上存在局部或临时歧义的一类特殊句子,在汉语和英语中都普遍存在,对于语言处理和认知机制等研究具有重要价值。本文聚焦于大语言模型理解分析花园幽径句的能力。本研究首先构建了一个具有典型结构的英汉双语花园幽径句数据集。随后基于该数据集开展了跨语言、跨模型的句法结构分析及语义理解的对比实验,考察多个大语言模型处理不同语言花园幽径句的消歧和理解分析能力,并对比了大模型与传统句法分析器Stanford Parser模型的分析能力。实验结果显示大语言模型测试结果呈现出与人类认知相似的花园幽径效应,可以利用名词合理性及动词偏向性为线索辅助消除句子歧义,英语句子的消歧能力显著优于汉语。语言模型句法分析与语义分析准确率具有较大差异。本实证研究揭示了大语言模型处理不同条件歧义句的表现差异,为语言处理和认知机制等提供了新的计算视角证据。"
"依 存 语 法 框 架 下 的 依 存 距 离 是 衡 量 句 法 分 析 难 度 的 重 要 指 标 。 本 文 基 于UD-Vietnamese依存树库对越南语依存距离的分布及影响越南语依存距离均值的因素进行分析研究。研究发现伬越南语依存距离分布符合幂律分布和指数分布的混合模型伻句长、长距离依存关系、依存方向均能对依存距离均值产生重要影响。该研究结果有助于从依存语法的角度揭示越南语的句法特点和规律伬为提出更科学合理的依存句法分析算法提供语言学支撑。"
"大语言模型在高效生成文本的同时也带来了文本滥用的问题,如何有效地区分不同大模型生成的文本成为了关键的挑战。为了解决这个问题,本文首先构建了一个面向多分类的大模型生成文本检测任务的数据集LGT-AA,包含7个领域的人类和10个常用大模型生成的94k条文本;其次,本文提出了一种提取不同大模型生成文本的全局性区分性特征的方案,并与分布特征进行融合构建文本检测器,提升了对生成文本的检测能力。实验结果表明,本文提出的方法在不同模型组合下和不同生成模型类别下都取得了更优的性能。"
"在信息爆炸的时代背景下,大模型每天都需处理庞大的知识与数据量。面对缺乏大规模工业级训练设施的现实,小参数模型成为了一种必要选择。然而,这些模型的信息处理需求远远超出其自然存储能力,这引发了一个核心问题:小参数模型应该记住什么,又应该忘记什么?传统的全记忆学习方法由于模型参数容量有限而不再高效,尝试记住一切不仅效率低,还可能引起过重的认知负担,降低思考质量。本文旨在重新定义有限记忆资源下的大语言模型记忆策略。本文首先将模型的记忆划分为内部记忆与外部记忆两个维度,并系统探讨了哪些知识应被优先内化为内部记忆。基于此,我们提出一种个性化记忆策略,针对不同类型的内部知识构建对应的对齐机制,使模型记忆更符合人类偏好与推理需求。这一策略不仅显著增强了小参数模型的理解能力与深度推理能力,也从根本上挑战了坜记得越多越好圢的传统假设,展示了战略性记忆选择在提升学习效率方面的巨大潜力。此外,本文还构建了关于内部记忆的训练集和评测数据集,并在仅使用3B参数规模的模型上进行了系统实验。实验结果显示,本文方法在该评测数据上实现了最佳效果,甚至在多个指标上超越了闭源模型及参数规模达70B的大型模型。为推动行业发展,我们已开源整个训练策略、模型权重及对应的评测数据集和评测方法。"
"对话系统情绪生成任务旨在生成待回复话语的情绪类别。针对现有情绪生成模型忽视了用户与模型价值观一致性对情绪生成的调节与引导作用,导致对话系统生成情绪与用户期望情绪之间存在偏差,降低了对话系统与用户之间的情绪共鸣。本文提出一种人机价值观驱动的对话情绪生成模型-HVDEGM,通过多阶段的门控机制动态引入用户价值观特征来引导情绪生成。该模型基于价值观一致性原理,设计了三个单元。首先情境修正注意力单元通过两次注意力机制增强了情绪与语义特征信息,其次价值观融合单元通过多阶段融合门控动态平衡了用户价值观特征与对话系统历史价值观特征的权重,最后反应调节单元通过双向注意力与交叉注意力机制,强化了情绪、语义、价值观特征之间的互补关联。模型在新构建的价值观对话数据集ValueCon上进行实验,实验结果表明,HVDEGM相比DialogueRNN、DialogueGCN等基线模型在Precision、Recall、F1及情绪共鸣度等指标分别提升了2.9%、2.5%、0.9%和4.1%,证明了所提出方法的有效性。"
"古文释义选择任务对语言模型的语义理解与语境匹配能力提出了较高挑战。本文提出一种基于强化学习的训练框架,通过结果导向的奖励设计,引导大语言模型优化古文释义判断策略。实验表明,相比监督微调(Supervised Fine-tuning, SFT),强化学习方法在准确率指标上表现更优。进一步分析发现,强化学习仅在释义选择任务上的训练不仅提升了模型的古文翻译能力,还在古汉语通用能力评估基准(ACLUE)上展现出更优的跨任务迁移性。相较之下,SFT训练后的模型在翻译与其他古文任务中的表现出现明显下降。本研究为古文处理任务提供了新的训练范式,验证了强化学习在非推理类语言任务中的有效性与泛化潜力。"
"古汉语自动分词是古籍数字化和智能化处理的关键环节,但古汉语在数千年演变过程中呈现出显著的历时性差异,对构建通用的分词模型构成了严峻挑战。为应对这一挑战,本研究构建了一个覆盖上古、中古及近代三个主要历史时期的大规模古汉语分词标注语料库,在此基础上,本文提出了一种基于时期嵌入(Period Embedding)的古汉语历时分词模型‘RoBERTa-PeriodEmb-Fusion-CRF‘ 。该模型以预训练语言模型‘roberta-classical-chinese-large-char‘ 为骨干,通过引入可学习的时期向量来感知文本的时代背景,并设计了非线性融合层以有效整合时期信息与上下文语义表示,最后结合条件随机场(CRF)进行序列解码。在构建的历时语料库上的大量实验结果表明,与不包含时期信息的强基线模型相比,本文提出的模型在整体分词性能(F1值达到0.9505)以及跨时期文本的适应性上均取得了显著提升。本研究不仅验证了显式建模时期信息对于提升古汉语分词效能的重要性,也为构建高性能、通用的古汉语处理工具提供了有益的思路和数据支持。"
"文本可读性评估研究旨在衡量文本对特定读者的理解难度,可以分为文档级和句子级。句长这一因素在句子级的难度分类中起主导作用,现有的句子级研究普遍未能控制该变量,从而掩盖了其他深层语言因素在句子难度中的作用。鉴于此,本文提出构建句长受控的句子难度分级语料库。然而,传统人工标注在构建该数据集上存在效率低、质量难以保证的问题。为解决这个问题,本文提出一种大语言模型驱动的智能受控改写方法,利用生成式人工智能从开放语料中自动筛选内容生成候选句,再通过专家审核来保证质量,最终构建了包含二分类三分类的控制句长句子难度分级语料库。在此数据集上的实验结果显示,传统特征分类模型的准确率在控制句长后显著下降,揭示了传统方法的局限性。大语言模型仍具有高准确率,表明其具备识别句长无关语义难度的能力。"
"语音驱动手势生成技术根据输入的语音自动生成丰富的虚拟角色动作,在数字动画、虚拟现实和人机交互等领域具有广泛的应用前景。虽然现有方法在时序连贯性方面取得一定进展,但由于缺乏对关节间局部交互的显式建模,生成的肢体动作往往存在机械感且缺乏自然性。针对这一问题,提出一种基于细粒度时空注意力的扩散模型,从细粒度层面建模骨架关节点间的动态依赖关系。具体而言,设计了一种时空Transformer,其中空间注意力层显式建模了关节间的空间结构关系,而时序注意力层捕获手势运动的动态性。此外,通过自适应实例归一化技术AdaIN引入说话者身份控制,实现个性化手势生成。在BEAT、BEAT2和SHOW数据集上验证了所提模型的有效性。"
"“左-右”作为普遍空间概念,其语义不断向政治、文化等领域衍化,但对其系统性的跨语言比较仍付阙如。本研究依托词汇类型学框架,选取汉语、英语、挪威语等十种语言,对“左-右”方位词的语义衍化路径与对应关联进行量化分析。在梳理权威词典义项的基础上,利用大语言模型(LLM)生成补充语料,并经母语者审核校对,最终构建跨语言方位词对“左-右”的语义网络。结果表明,“左-右”普遍沿“空间→政治→文化”三阶衍化,对应性语义衍化呈现高度跨语言一致性。该发现为二元对立概念的跨语言普适性提供了新的实证支持,亦丰富了方位词语义演变的类型学证据。本文提出的“智能体设计+上下文学习+多语对齐控制+母语者验证”混合模式为低资源语言语料扩展与语义研究提供了可复制方案。研究成果可服务于跨语言语义探索及基于对立概念的语言教学设计。"
"近年,大语言模型(Large Language Models, LLMs)在通用文本翻译任务上的翻译质量取得大幅的提升,但是在面对多领域文本时,翻译质量呈现明显下降。如何利用有限的领域双语平行语料增强领域翻译知识成为主要的研究目标,已有方法大多使用人为设置的领域标签学习语义表示,导致其在消歧知识的获取上受到限制,如何构建有效的消歧知识成为一种挑战。为此,本文提出一种多领域翻译中语义消歧的话题方向盘方法,旨在增强大语言模型在多领域上的语义消歧能力,具体包括:(1)基于话题模型的语义表示获取机制:我们首先利用ETM自动聚类算法获取细小颗粒度的话题语义表示用于之后构建消歧知识,这种话题的表示更贴近语义,也更适合作为语义单元来构建语义表示。然后,我们设计TopicModel函数将大模型的表示转换成话题的语义表示。(2)基于话题方向盘的领域消歧知识获取机制:我们设计可学习的变换矩阵,通过建模不同领域下话题分布的投影方向获取多领域上的语义消歧知识。话题的语义表示经过领域方向投影的再次变换后,有效的语义消歧特征得到强化,从而提升大语言模型在不同领域下的语义消歧能力。我们选取Qwen-2.5-1.5B作为基础模型,在英语-汉语以及德语-汉语两个多领域翻译任务上进行实验验证。实验结果表明,该方法在平均BLEU值和COMET均超出基线模型,进一步我们对于翻译质量的提升与消歧效果之间的关系进行了分析,并通过翻译实例给出详细说明。"
"篇章机器翻译旨在使用计算机将源语言篇章自动翻译为具有相同语义的目标语言篇章,是机器翻译的前沿研究热点。相对于传统的句子级翻译,以篇章作为翻译单位,模型能够更有效地利用上下文信息,提升翻译的一致性与连贯性,具有广阔的应用前景和研究价值。与资源丰富语言(如汉语、英语、法语等)机器翻译研究相比,藏语机器翻译资源稀缺,公开可用的数据集数量有限,在篇章级机器翻译方面的探索尚无公开论文发表。鉴于此,本文首先构建一个藏汉翻译数据集,标注了句子级、段落级和篇章级的边界,为藏汉篇章翻译任务提供高质量的多粒度标注数据集。然后,本文基于该数据集研究了藏汉篇章机器翻译,并对比机器翻译在句子层面、段落层面和篇章层面翻译效果的差异。本文对所构建的藏汉篇章翻译语料库予以开源,希望能推动相关研究的发展。链接:https://github.com/liyc7711/tb-zh-mt。"
"近年来,大语言模型展现出了从训练语料中存储并提取知识的优秀能力,但相应地,其可靠性也容易遭受训练语料中错误信息的破坏,进而产生信息过时、错误回复等问题。基于神经元识别的知识编辑方法通过在模型中识别并微调与目标知识相关的知识神经元,实现对模型内部知识的精确修改。然而,本文研究发现,知识的表达形式会显著影响知识神经元的识别结果,例如,现有神经元识别方法对于同一知识的不同表达形式识别得到的神经元集合平均重叠率只有21.86%。这就导致只对单一的表达形式进行知识编辑无法覆盖到与这个知识相关的所有神经元,所以现有知识编辑方法的鲁棒性往往较差。为了全面且准确地识别到与某一知识相关的所有神经元,本文设计了一种轻量级关联神经元识别器(Light weight Associated Neuron Detector,LAND),通过学习不同表达形式的知识识别出的知识神经元集合之间的差异,从而在知识神经元识别的过程中,自动补全因表达形式差异而未被检出的知识神经元。实验结果表明,LAND方法能够将不同表达形式的文本识别出的知识神经元平均重叠率提升至96%以上,在不同句式的知识编辑成功率上较基线方法多提升了至多10.83个百分点。"
"我国现行有效法律在内容上所呈现出的多样性及其庞大体量,使得人工方式的可读性评估难以实现全面覆盖。 本研究采用大语言模型对立法文本可读性进行评估,以深度学习的端到端方式摆脱了可读性研究对传统语言特征工程的路径依赖。 研究表明,大语言模型对立法文本可读性的自动化评分与人工评分具有显著相关性。 本研究从部门法等维度出发,系统揭示了不同法律的显著特征差异,刻画了我国现行有效法律文本可读性的整体面貌;并通过大语言模型文本生成与人工校验,从法律适用的角度探讨了提升立法文本可读性的可能路径,为立法语言的优化提供参考。"
"大语言模型的隐性偏见会隐蔽地影响模型的决策过程,使其在应用中难以保证公平性。本文首先构建基于决策的提示数据集进行隐性偏见评估,实验结果表明性能强的大语言模型可能表现出更严重的隐性偏见。进而为了缓解模型的隐性偏见,本文探索了自我反思和模型编辑两类方法。实验发现前者有助于识别隐性偏见,但无法在回答中去偏。在模型编辑实验中通过构建纠偏数据集,得出对模型后四层进行微调可获得最佳去偏效果,这一结论显示出有限参数调整在缓解隐性偏见方面的潜力。"
"语音与语义的交互加工机制是理解语言认知过程的核心问题之一。以往研究多集中于词汇层面的线性处理路径,而对音节内部语音片段在语义加工中的作用关注不足。为探索语音信息在词汇语义加工中的调节机制,并为心理加工模型建构提供实证依据,本研究采用事件相关电位(ERP)技术,结合听觉启动范式,考察汉语双音节词尾音(第二音节韵母)相似性对语义加工的影响。实验操控词对的尾音相似性(相同/不同)与语义关系(相关/无关),通过语义判断任务测量被试的行为反应与脑电指标。结果发现:(1)尾音相似词对在晚期N400时间窗内诱发更大的负波幅,提示尾音信息对语义加工过程具有显著调节作用;(2)语义启动效应在尾音不同条件下显著,而在尾音相同条件下消失,显示语音信息可影响语义加工的时间进程与效应强度。研究表明,在听觉词汇加工中,语音片段的结构特征(如尾音相似性)不仅被高度感知,而且会通过调节语义预激活和整合过程参与语义建构。这些发现支持语音中语义交互模型的构想,揭示了语言加工过程中低层语音输入对高层语义处理的动态影响,为听觉词汇识别的认知心理模型建构提供了重要证据。"
"随着社交媒体的广泛普及,模因(meme)已成为信息传播与舆论引导的重要载体,其中蕴含的仇恨内容对网络生态与公共安全构成威胁,尤其是通过图像暗示、文化隐喻或社会符号等方式表达的隐性仇恨模因,具有更强的隐蔽性与误导性,给仇恨模因检测任务带来显著挑战。针对上述问题,本文提出了一种仇恨模因理解模型(Hateful Meme Understanding Model,HMUM),在Qwen2.5-VL-72B-Instruct模型基础上引入LoRA微调,并设计了一种多模态多阶段的提示学习框架。该框架通过阶段性引导模型依次完成文本识别、情绪建模与仇恨性推理,逐步增强其对模因语义与情感的理解能力,从而有效提升模型在中文语境下检测语义隐晦、情绪复杂仇恨模因的准确性。在公开数据集ToxiCN MM上的实验结果表明,HMUM(Qwen)在整体任务中取得了显著性能提升,在隐性仇恨模因子集检测方面,相较于基线模型表现出更强的优势。为评估其在更广泛隐性场景中的检测能力,本文构建了以隐性仇恨模因为主的数据集ITTD-220,实验结果显示,HMUM(Qwen)在该数据集上的检测性能同样优于现有模型,验证了其出色的泛化能力。"
"无监督双语词典归纳(Bilingual Lexicon Induction,BLI)通过学习映射函数对齐两种不同语言的单语词嵌入空间,从而推导单词翻译,在相似语言对中取得显著成功。然而,传统方法依赖单一线性映射,在远距离或低资源语言对上性能欠佳。为解决此问题,本文提出DM-BLI,一个基于动态多子空间对齐的无监督双语词典归纳算法及其应用框架。首先,DM-BLI通过多子空间映射提升对齐精度,重构源语言词嵌入空间,采用无监督聚类识别子空间,结合粗略全局对齐定位目标空间对应子空间,并通过簇内和簇间对比学习优化映射矩阵。在包含5个高资源和5个低资源语言对的有监督和无监督实验中显著提升性能。此外,DM-BLI基于所构建的词典使用logits lens技术评估大语言模型(Large Language Model, LLM)的跨语言能力,通过翻译和重复任务计算余弦相似度,结合词向量空间语义特征验证模型生成翻译的语义合理性。相较传统LLM的跨语言评估方法仅以静态的BLI翻译对为标准,DM-BLI能识别未被词典覆盖但语义合理的翻译,显著提升评估的鲁棒性和语义泛化能力,更准确全面地衡量大语言模型的跨语言语义映射能力。我们的代码发布https://github.com/huling-2/DM-BLI.git."
"Self-supervised learning (SSL) speech models have achieved remarkable performance across various tasks, with the learned representations often exhibiting a high degree of generality and applicability to multiple downstream tasks. However, these representations contain both speech content and some paralinguistic information, which may be redundant for content-focused tasks.Decoupling this redundant information is challenging. To address this issue, we propose a Self-Supervised Contrastive Representation Learning method (SSCRL), which effectively disentangles paralinguistic information from speech content by aligning similar content speech representations in the feature space using self-supervised contrastive learning with pitch perturbation and speaker perturbation features. Experimental results demonstrate that the proposed method, when fine-tuned on the LibriSpeech 100-hour dataset, achieves superior performance across all content-related tasks in the SUPERB Benchmark, generally outperforming prior approaches."
"Large Language Models (LLMs) have made significant advancements in sentiment analysis, yet their quality and reliability vary widely. Existing LLM evaluation studies are limited in scope,lack a comprehensive framework for integrating diverse capabilities, and fail to quantify the im-pact of prompt design on performance. To address these gaps, this paper introduces a set of LLM evaluation criteria with detailed explanations and mathematical formulations, aiding users in understanding LLM limitations and selecting the most suitable model for sentiment analysis.Using these criteria, we apply the Technique for Order Preference by Similarity to an Ideal Solu-tion (TOPSIS), a classic decision-making method, to rank the performance of LLMs in sentimentanalysis. We evaluated six popular LLMs on three Twitter datasets covering different topics and analyze the impact of prompt design by assessing model-prompt combinations. Additionally,a validation experiment on a publicly available annotated dataset further confirms our ranking results. Finally, our findings offer valuable insights into the evaluation and selection of LLMs for sentiment analysis."
"Large Language Models (LLMs) have demonstrated remarkable capabilities in semantic under-standing and text generation. However, when applied to downstream tasks such as Chinese Grammatical Error Correction (CGEC), they often suffer from over-correction issues, where grammatically correct parts are mistakenly altered. Moreover, some existing methods aim to address over-correction in Sequence-to-Sequence (Seq2Seq) models, they are difficult to adapt to decoder-only LLMs. To address these challenges, we propose a Chunk-based Chain ofThought (CoT) Prompting Method. Our study is structured into three key components. Initially, we identify specific types of grammatical errors in the input sentences. Following this,sentences are segmented into smaller chunks, and each chunk is analyzed to match the detected error types. Ultimately, the aggregated information guides LLMs in performing localized correction within the input sentences. The experimental results have proved the effectiveness of our method in mitigating over-correction, achieving higher F0.5 score while maintaining robust grammatical error correction performance. This method provides innovative perspectives on employing LLMs to enhance the precision and granularity of CGEC task."
"LLM-enhanced social robots (LLM-Bots) generate responses similar to human interactions and pose risks to social media platforms. Distinguishing AI-generated texts (AIGTs) from human-written content is important for mitigating these threats. However, current AIGT detection technologies face limitations in social media contexts, including inadequate performance on shorttexts, poor interpretability, and a reliance on synthetic datasets. To address these challenges, this study first constructs a social media dataset composed of 463,382 Weibo comments to capture real-world interactions between LLM-Bots and human users. Second, a stylo metric feature set tailored to Chinese social media is developed. We conduct a comparative analysis of these features to reveal linguistic differences between human-written and AI-generated comments. Third,we propose a lightweight stylo metric feature-based self-attention classifier (SFSC). This model achieves a strong F1-score of 91.8% for detecting AI-generated short comments in Chinese while maintaining low computational overhead. Additionally, we provide interpretable criteria for the SFSC in AIGT detection through feature importance analysis. This study advances detection forAI-generated short texts in Chinese social media."
"Linguistic acceptability judgments are essential for evaluating how language models internalize human-like grammatical knowledge. Though some studies have evaluated large language mod-els (LLMs) in this context, existing research lacks systematic exploration of diverse learning paradigms in a multilingual setting. In this paper, we present the first multilingual evaluation of LLMs across four languages (English, Chinese, Japanese, and Russian) in the field of linguistic acceptability. Our evaluation spans both general-purpose (i.e., GPT-4o, GPT-4o mini,DeepSeek-V3, GLM-4-32B, and the Qwen series) and reasoning-oriented (QwQ-32B-Preview and DeepSeek-R1-32B) models under zero-shot and monolingual, cross-lingual and multilingual fine-tuning settings, with comparisons to pre-trained language model (PLM) baselines. Our analysis highlights the strong generalizability of large-scale LLMs through zero-shot prompting, the challenges of fine-tuning small-sized LLMs with skewed training data, the effectiveness of multilingual fine-tuning for low-resource languages, the scaling law exhibited on the task, and the limitation of reasoning-oriented models on the task, even when “aha moments” occur during the reasoning process."
"Large language models (LLMs) have become integral components of various AI solutions, with the reinforcement learning from human feedback (RLHF) stage playing a critical role in align-ing model outputs with human preferences. However, generating the human preference data required for RLHF is often costly and time-consuming due to its reliance on human evaluation.This study addresses this challenge within the dialogue scenarios of the fintech industry. We leverage rich, non-confidential, multi-turn dialogue data, such as call center dialogue records,which include associated business metrics (e.g., problem-solving rates, turnover ratios) to con-struct preference-aligned data. We introduce Self-Preference, an automated method for creating preference-aligned data guided by these objective business metrics. The approach involves clustering dialogue histories based on their semantic representations and calculating a well-designed conditional probability ratio that correlates sequences with business metrics to generate preference data. In contrast to traditional preference alignment data generation methods that depend on subjective human evaluations, Self-Preference significantly reduces labeling costs and mitigates model-induced biases. Experimental results indicate that models trained with Self-Preference generated data demonstrate a strong positive correlation with target business metrics, highlight-ing the method’s effectiveness in facilitating efficient, goal-oriented alignment of LLMs."
"Large Language Models (LLMs) have demonstrated significant potential in interpretable translation quality estimation by providing both holistic ratings and fine-grained feedback. However,state-of-the-art methods, such as GEMBA-MQM, still suffer from an excessive number of false positives in error prediction, leading to misalignment with human annotations and reducing interpretability. To address this issue, we propose MQM-MSC, a novel training-free framework that employs a mask-driven self-correction (MSC) mechanism. The core of MSC is to use masks to highlight error spans in the initial prediction, enabling the model to re-evaluate these masked portions and verify their correctness. This approach mirrors human cognitive processes: when individuals express inconsistent judgments about the same issue at different times, it often indicates that their initial assessment was flawed. Similarly, MSC exploits contradictions between two evaluations to identify and filter false positives, thereby improving the accuracy and reliability of error annotations. Experimental results show that MQM-MSC effectively reduces false positives across four LLMs and three language pairs, consistently improving the reliability and quality of error annotations in the GEMBA-MQM approach"
"Recent advances in large-scale pre-training have substantially enhanced the robustness and generalization capabilities of foundation models (e.g., Qwen3 and Llama-4). However, when fine-tuning them on downstream tasks, these models often latch onto dataset-specific biases, learning spurious correlations tied to easy-to-learn but non-robust features. This undermines their performance under distribution shifts, despite strong in-distribution (ID) accuracy. Existing fine-tuning methods, including full-parameter and parameter-efficient techniques, primarily optimize for ID performance and largely overlook out-of-distribution (OOD) robustness. Meanwhile, debiasing has been explored in full fine-tuning, while debiasing strategies on Parameter-Efficient Fine-Tuning (PEFT) remain underexplored. To this end, in this paper, we propose Enhanced Debiased Gradient Extraction (EDGE), a lightweight gradient projection-based method that explicitly suppresses bias-amplifying updates during fine-tuning process. EDGE is a model-agnostic, and plug-and-play debiasing method that operates without relying on predefined bias types or labels.It seamlessly integrates with both full and parameter-efficient fine-tuning, and generalizes acrossNLP and vision tasks. Experiments on synthetic and real-world benchmarks demonstrate thatEDGE effectively reduces bias and consistently improves OOD generalization, offering a unified and practical framework for robust adaptation under dataset bias."
"Abstract reasoning is a challenging task that involves identifying patterns from limited input-output grids and applying them to new grids. With the development of large language models(LLMs), recent studies attempt to transfer the problems to textual format and tackle abstract reasoning tasks using models such as GPT-4. However, the overall accuracy is still low, which also results in the poor quality of abstract reasoning data directly synthesized by GPT-4, making it unsuitable as effective fine-tuning data. In this paper, we propose mixture program-based data synthesis strategies, including low-level code-based synthesis, high-level DSL-based synthesis,and shuffle-based synthesis. Through these strategies, we construct diverse and valid abstract reasoning instruction data to help improving the general abstract reasoning ability of LLMs for multiple datasets. Experimental results show that, by supervised fine-tuning Qwen-2.5-7B on our synthesized instruction data, the resulting model shows improved abstract reasoning ability and outperforms various strong baseline LLMs, including closed-source model GPT-4 and open-source models such as LLaMA-3 and Qwen-2.5. We release the logs by GPT and our model at https://github.com/szu-tera/ARC."
"It is widely known that the first language (L1) of the English learners will influence their language study, causing them make to biased errors. However, it is relatively limited for the research of using the L1 information to improve Grammatical Error Correction (GEC) models. Among the limited research, a common method is to train a set of GEC models, and each model is trained bya corpus from one (and only one) specific L1 background. This method has been proven efficient,while the waste of the training / fine-tuning data makes it suffer from the data limitation issue.This paper introduces a novel method to address this issue by exploiting the linguistic similarities between a language family and its member languages. We expand the fine-tuning data from one specific L1 background to its language family one, making the quantity increase exponentially. We use the Italic language family corpus as our language family corpus and experiment with two approaches facing two situations, mainly differing in development data. The results show that,for the approach that uses the Italic language family corpus to be the fine-tuning data and uses the development data where the L1 background is the same as the one of the test data, the GEC models improve clearly; however, the way that influences the models is not uniform, and varies by error types."
"Rumor detection on social media has recently attracted significant attention. Due to the complex user group and lack of regulation, rumor-spreaders intentionally disseminate rumors to sway pub-lic opinion, severely harming the general interests. Existing approaches generally perform rumor detection by analyzing both image and text modalities, and pay less attention to the interaction behaviors in social media, which can assist in distinguishing rumors from normal information.Furthermore, the images associated with rumors are often inconsistent or manipulated, how to distinguish these different features and utilize them effectively has become crucial in prevent-ing the widespread dissemination of rumors. To address the aforementioned issues, we proposeCross-modal Ambiguity Learning with Heterogeneous Interaction Analysis (CAHIA) for rumor detection. Specially, we design a novel heterogeneous graph feature extractor to fully utilize the different types of behavioral patterns in social interaction networks, we design an frequency inception net to extract manipulated visual features and adopt different fusing strategies to detect various types of rumors according to the ambiguity between text and image. Finally, a hierarchical cross-modal fusing mechanism is used to simulate the process users view and determine the authenticity of posts. Extensive experiments results demonstrate that CAHIA outperforms state-of-the-art models on four large-scale datasets for rumor detection in social media."
"Evidence-based fact-checking aims to verify or debunk claims using evidence and has greatly benefited from advancements in Large Language Models (LLMs). This task relies on clarify-ing and discriminating relations between entities. However, autoregressive LLMs struggle with understanding relations presented in different orders or narratives, as their unidirectional na-ture hampers effective performance. To address this challenge, we propose a novel method that leverages bidirectional attention as an external adapter to facilitate two-way information aggregation. Additionally, we employ hierarchical sparse graphs to merge local and global information and introduce an efficient feature-compression technique to minimize the number of adapter parameters. Experimental results on both English and Chinese datasets demonstrate the significant improvements achieved by our approach, showcasing state-of-the-art performance in the evidence-based fact-checking task."
"Large Language Models (LLMs) inevitably suffer from hallucinations, as relying solely on their parametric knowledge cannot guarantee the accuracy of generated content. To enhance text generation, retrieval-augmented generation (RAG) is proposed to incorporate external knowledge to achieve this. However, its effectiveness heavily depends on the relevance of retrieved documents, which poses a critical challenge: how to ensure the accuracy and reliability of model responses when retrieval results are inaccurate. Tackling this challenge, we propose RetrievalJudgment Augmented Generation (RJAG), a method that can enhance RAG through LLM-driven fine-grained relevance judgment mechanism and a task-adaptive knowledge combination strategy. RJAG judges and dynamically combines retrieved documents for both open-ended generation and closed-ended selection tasks. Additionally, large-scale web search is also included to expand the knowledge beyond static corpora. Experimental results on multiple bench-marks show that RJAG outperforms existing RAG methods, which will significantly enhance the accuracy and reliability while maintaining the system’s simplicity. Code is available at https://github.com/wangkz2023/RJAG."
"Addressing the limitations of the Skip-gram with Negative Sampling (SGNS) model related to negative sampling, subsampling, and its fixed context window mechanism, this paper first presents an in-depth statistical analysis of the optimal solution for SGNS matrix factorization,deriving the theoretically optimal distribution for negative sampling. Building upon this analysis, we propose the concept of Global Semantic Weight (GSW), derived from Pointwise Mutual Information (PMI). We integrate GSW with word frequency information to improve the effectiveness of both negative sampling and subsampling. Furthermore, we design dynamic adjustment mechanisms for the context window size and the number of negative samples based on GSW, enabling the model to adaptively capture contextual information commensurate with the semantic importance of the center word. Notably, our optimized model maintains the sametime complexity as the original SGNS implementation. Experimental results demonstrate that our proposed model achieves competitive performance aganist state-of-the-art word embedding models including SGNS, CBOW, and GloVe, across multiple benchmark tasks.Compared with the current mainstream dynamic word vector models, this work emphasizes achieving a balance between efficiency and performance within a static embedding framework, and provides potential supplementation and support for complex models such as LLMs."
"In real world, large language models (LLMs) can serve as the assistant to help users accomplish their jobs, and also support the development of advanced applications. For the wide application ofLLMs, the inference efficiency is an essential concern, which has been widely studied in existing work, and numerous optimization algorithms and code libraries have been proposed to improve it.Nonetheless, users still find it challenging to compare the effectiveness of all the above method sand understand the underlying mechanisms. In this work, we propose a coarse-to-fine method that encompasses both experimental and analytical components. This method can be applied across various models and inference libraries. Specifically, we examine four usage scenarios within two practical applications. We further provide both theoretical and empirical fine-grained analyses of each module in the Transformer architecture. Our methods can be a general and invaluable method for researchers to evaluate various code libraries and improve inference strategies across different LLMs. We open-source the supporting dataset, code, and evaluation scripts at the link:https://github.com/RUCAIBox/Inference-Efficiency-Evaluation."
"This study aims to test how large language models (LLMs) understand gradable adjectives and whether their understanding compares with humans, under the framework of formal semantics.We introduce a diagnostic dataset, referred to as the Modifier-Adjective Scale Probe (MASP),to evaluate how well LLMs understand a gradable adjective (e.g., long) when the adjective is combined with one modifier (e.g., very long or slightly long, a condition referred to as degree modification) or is further negated (e.g., very not long and not very long, a condition referred to as compositional negation). The dataset consists of over 80,000 natural language inference questions in both Chinese and English. We apply the MASP dataset to test both humans and11 popular LLMs, including GPT-4o and Gemini-2.0-Flash. The results show that most LLMscan correctly understand whether a modifier boosts (e.g., very) an adjective. However, they fail to understand the modifiers that weaken the degree and the negation forms of modifiers.Furthermore, we parameterize the human and LLM behavior, and find that the judgment patterns of LLMs differ from humans especially in the Chinese tests. These findings suggest that LLM sare still not well aligned with humans in terms of the interpretation of simple adjective phrases,and MASP provides a new approach to quantify the interpretation of adjective phrases in LLMs."
"The goal of this work is zero-shot visual voice cloning (ZS-V2C), which aims to generate speech samples with unseen speaker identity and prosody derived from a video clip and an acoustic reference. ZS-V2C presents greater challenges as: 1) unseen speaker modeling; and 2) unseen prosody modeling. Unlike previous works, we propose a novel ZS-V2C framework that incorporates a hierarchical face-styled diffusion model (HFSD-V2C). Specifically, first, we leverage cross-modal biometrics to predict unseen speaker embeddings based on facial features. Then, we jointly model the unseen prosodic features at the text, speech and video levels. Finally, a diffusion model is constructed based on the embeddings of the unseen speaker and prosodic features,enabling the generation of expressive and diverse speech. Extensive experiments on the LRS2and GRID benchmark dataset demonstrate the superior performance of our proposed method."
"The end-to-end speech translation task involves directly transforming speech into the text of another language, bypassing the generation of an intermediate transcription. However, existing methods may lose key information during cross-modal length alignment and fail to effectively integrate different representations, resulting in low quality of the fused representation. To address these issues, we propose an efficient method named CRAF for effective cross-modal alignment and fusion for speech translation, which reduces information loss and enhances the integration of cross-modal representations. First, CRAF minimizes information loss by improving the cross-modal length alignment, ensuring the alignment process retains more critical information from the speech modality. Second, CRAF strengthens the integration of cross-modal representations by allowing the model to combine complementary features from diverse modalities, enhancing its capacity to concentrate on the most pertinent and critical information. Finally, we evaluateCRAF by conducting extensive experiments on eight language pairs from the MuST-C dataset.Experiments show that the average BLEU score of CRAF achieves 29.0, outperforming other comparison methods. Our code is available at https://github.com/wu-wen-zhou/first/tree/master."
"Although Large Language Models (LLMs) have demonstrated strong instruction-following abil-ity, they are further supposed to be controlled and guided by inferential rules in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of inferential rule-following capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract inferential rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at:https://gitee.com/forangel2014/llm-rule-following-code"
"This paper addresses the challenges of data scarcity and limited speaker resources in Lao-English code-switched speech synthesis. We propose a neural encoder-decoder-based method for mixed-lingual speech synthesis. The method first extracts phoneme-level speech representations and employs a dot-product attention mechanism to map Lao and English phonemes into a shared la-tent space, thereby enhancing the model’s capability to represent cross-lingual phonetic information. In addition, language ID embedding module is extended to explicitly indicate the language of each input token, helping the model distinguish and adapt to language-specific pronunciation characteristics. Experiments are conducted on the open-source English dataset LibriTTS anda proprietary Lao speech corpus. Both subjective evaluations (MOS, AB preference tests) and objective metrics (RMSE) demonstrate that the proposed approach significantly outperforms the baseline VALL-E X model in terms of naturalness and language-switching fluency. Furthermore, ablation studies confirm that both the shared phoneme latent space and the language ID mod-ule play critical roles in improving synthesis quality. This approach offers a novel solution for integrating low-resource languages into mixed-lingual speech synthesis."
"Textual data often contain biases that compromise fairness in AI systems, particularly in sensitive areas such as gender, race, and politics. While large language models (LLMs) have shown success across various tasks, they still face limitations due to inherent biases within the model sand restrictive safety policies that hinder direct bias mitigation. To overcome these challenges,we propose UMAD (Unsupervised Multi-Agent Debate), a novel framework that leverages aMulti-Agent Debate mechanism alongside Best-Worst Scaling (BWS) to foster more effective discussions among LLMs, facilitating the identification of biases. By combining this with gradient-based interpretation techniques, UMAD extracts token-level bias insights, which are then integrated into models using in-context learning. This enhances the debiasing performance, as shown by our experiments across three bias categories—gender, religion, and politics—using five different LLMs. Our approach demonstrates significant improvements in metrics, with large models matching or even surpassing GPT-4 in Style Accuracy (STA). We release our code at:https://github.com/Couen/UMAD.git."
This paper investigates domain adaptation in Chinese Spelling Correction (CSC) based on the instruction-following ability of large language models (LLMs). In the instructions, we include a variety of domain-specific requirements for spelling correction, such as the domain’s formal-ity or writing tone, which go beyond the considerations of previous CSC research. To evaluate the LLMs’ performance on instruction-following, we propose IDSpell, a semi-supervised con-struction pipeline for a CSC dataset containing a wide range of domain-specific sentences along with specific instructions. We construct a dataset with IDSpell and evaluate it on Qwen2.5 andGPT-4o, where we find that instructions serve a meaningful influence in correction, increasing the average F1 score by 10.4% compared to when the instructions are not provided. To further enhance the result, we propose Contrastive Prompting, a method incorporating contrastive false examples into the prompt to better guide the model to understand the instruction. Experiments demonstrate that our method outperforms baseline prompting with an average improvement of 5.4%. Our dataset and code are publicly available for further research.
"In recent years, dialogue summarization has emerged as a rapidly growing area of research in natural language processing. Dialogue summarization is challenging due to dispersed key information, redundant expressions, ambiguous topic identification, and difficult content selection.To address these challenges, we propose an innovative approach to dialogue summarization that integrates topic segmentation and graph-structured modeling. Specifically, we first per-form topic segmentation of the dialogue through clustering and quantify the key information in each utterance, thereby capturing the dialogue topics more effectively. Then, a redundancy graph and a keyword graph are constructed to suppress redundant information and extract key content, thereby enhancing the conciseness and coherence of the summary. Evaluations were conducted on the DialogSum, SAMSum, CSDS, and NaturalConv datasets. The experimental results demonstrate that the proposed method significantly outperforms existing benchmark mod-els in terms of summary accuracy and information coverage. The Rouge-1 scores achieved were 48.03%, 53.75%, 60.78%, and 81.48%, respectively, validating its effectiveness in the dialogue summarization task. Our code is available at https://anonymous.4open.science/r/TAG-E64A."
"This paper introduces DualReward, a novel reinforcement learning framework for automatic dis-tractor generation in cloze tests. Unlike conventional approaches that rely primarily on super-vised learning or static generative models, our method employs a dual reward structure with adaptive scaling that differentiates between human-created gold standard distractors and model-generated candidates. The framework dynamically adjusts reward signal intensity based on model performance and confidence. We evaluate our approach on both passage-level (CLOTH-F) and sentence-level (MCQ) cloze test datasets, demonstrating consistent improvements overstate-of-the-art baselines. Experimental results show that our adaptive reward scaling mechanism provides modest but consistent benefits on homogeneous datasets (CLOTH-F) and more substantial improvements (3.48-3.86% in P@1) on diverse, cross-domain data (MCQ), suggest-ing its particular effectiveness for handling varied question types and domains. Our work offers a flexible framework that effectively balances learning from reliable human examples while exploring novel, high-quality distractors for automated test generation."
"Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating ona turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the duplex capability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of input and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs."
"Recent advancements in Large Language Models (LLMs) have markedly improved SQL generation. Nevertheless, existing approaches typically rely on single-model designs, limiting their capacity to effectively handle complex user queries. In addition, current methods often face difficulties in selecting the optimal SQL from multiple candidates. To mitigate these limitations,this study presents DSMR-SQL, a two-stage framework consisting of: (1) Dual-Strategy SQLGeneration: DSMR-SQL aims to produce a broader spectrum of SQL queries by using multiple models with two strategies: Supervised Fine-Tuning and In-Context Learning; (2) Multi-RoleSQL Selection: DSMR-SQL seeks to identify the SQL most aligning with user intent by introducing a collaborative framework involving three roles (i.e., Proposer, Critic, Summarizer).Extensive experiments on various datasets substantiate the efficacy of DSMR-SQL in enhancing SQL generation."
"The performance of neural machine translation relies on a large amount of data, but crawled sentence pairs are of different quality. The low-quality sentence pairs may provide helpful translation knowledge but also teach the model to generate low-quality translations. Making the model aware of the quality of training instances may help the model distinguish between good and bad translations while leveraging the translation knowledge. In this paper, we evaluate the quality of training instances with the average per-token loss (negative log-likelihood) from translation mod-els, convert the quality scores into embeddings through vector interpolation and feed the quality embedding into the translation model during its training. We ask the model to decode with the best quality score to generate good translations during inference. Experiments on the IWSLT 14 German to English, WMT 14 English to German and WMT 22 English to Japanese translation tasks show that our method can effectively lead to consistent and significant improvements across multiple metrics."