Proceedings of the 22nd Chinese National Conference on Computational Linguistics

Maosong Sun, Bing Qin, Xipeng Qiu, Jing Jiang, Xianpei Han (Editors)


Anthology ID:
2023.ccl-1
Month:
August
Year:
2023
Address:
Harbin, China
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
URL:
https://aclanthology.org/2023.ccl-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2023.ccl-1.pdf

pdf bib
基于推理链的多跳问答对抗攻击和对抗增强训练方法(Reasoning Chain Based Adversarial Attack and Adversarial Augmentation Training for Multi-hop Question Answering)
Jiayu Ding (佳玙丁,) | Siyuan Wang (王思远) | Zhongyu Wei (魏忠钰) | Qin Chen (陈琴) | Xuanjing Huang (黄萱菁)

“本文提出了一种基于多跳推理链的对抗攻击方法,通过向输入文本中加入对抗性的攻击文本,并测试问答模型在干扰数据下生成答案的准确性,以检测问答模型真正执行多跳推理的能力和可解释性。该方法首先从输入文本中抽取从问题实体到答案实体的推理链,并基于推理链的特征把多跳问题分为了不同的推理类型,提出了一个模型来自动化实现问题拆解和推理类型预测,然后根据推理类型对原问题进行修改来构造攻击干扰句。实验对多个多跳问答模型进行了对抗攻击测试,所有模型的性能都显著下降,验证了该攻击方法的有效性以及目前问答模型存在的不足;向原训练集中加入对抗样本进行增强训练后,模型性能均有所回升,证明了本对抗增强训练方法可以提升模型的鲁棒性。”

pdf bib
基于不完全标注的自监督多标签文本分类(Self-Training With Incomplete Labeling For Multi-Label Text Classification)
Junfei Ren (任俊飞) | Tong Zhu (朱桐) | Wenliang Chen (陈文亮)

“多标签文本分类((Multi-Label Text Classification, MLTC)旨在从预定义的候选标签集合中选择一个或多个文本对应的类别,是自然语言处理C)旨在从预定义的候选标签集合中选择一个或多个文本对应的类别,是自然语言处理(Natural Language Processing,NLP)的一项基本任务。前人工作大多基于规范且全面的标注数据集,而这些规范数据集需要严格的质量控制,一般很难获取。在真实的标注过程中,难免会丢失掉一些相关标签,进而导致不完全标注问题。为此本文提出了一种基于局部标注的自监督框架(Partial Self-Training,PST),该框架利用教师模型自动地给大规模无标注数据打伪标签,同时给不完全标注数据补充缺失标签,最后再利用这些数据反向更新教师模型。在合成数据集和真实数据集上的实验表明,本文提出的PST框架兼容现有的各类多标签文本分类模型,并且可以缓解不完全标注数据对模型的影响。”

pdf bib
融合汉越关联关系的多语言事件观点对象识别方法(A Multilingual Event Opinion Target Recognition Method Incorporating Chinese and Vietnamese Association Relations)
Gege Li (李格格) | Junjun Guo (郭军军) | Zhengtao Xu (余正涛) | Yan Xiang (相艳)

“越南语观点对象识别是越南语事件观点分析的重要研究内容。由于汉越两种语言的语法结构上存在差异,使得多语言事件关联复杂,观点对象表征困难。现有研究方法仅能实现汉越双语的表征,未能有效捕获并利用汉越双语事件中要素的关联关系。因此,本文提出一种融合汉越关联关系的多语言事件观点对象识别方法,利用中文和越南语事件间的要素共现和整体语义关联构建汉越多语言事件表征网络,使用多语言预训练语言模型获得要素节点的特征向量,利用图卷积网络对节点信息进行聚合,得到同一语义空间下汉越双语的公共表征,实现汉越事件观点对象的识别。实验结果表明本文模型能够更有效地构建多语言关联信息,其F1值较多个基线模型都有明显提高。”

pdf bib
基于网络词典的现代汉语词义消歧数据集构建(Construction of a Modern Chinese Word Sense Dataset Based on Online Dictionaries)
Fukang Yan (严福康) | Yue Zhang (章岳) | Zhenghua Li (李正华)

“词义消歧作为自然语言处理最经典的任务之一,旨在识别多义词在给定上下文中的正确词义。相比英文,中文的一词多义现象更普遍,然而当前公开发布的汉语词义消歧数据集很少。本文爬取并融合了两个公开的网络词典,并从中筛选1083个词语和相关义项作为待标注对象。进而,从网络数据及专业语料中为抽取相关句子。最后,以多人标注、专家审核的方式进行了人工标注。数据集1包含将近2万个句子,即每个词平均对应约20个句子。本文将数据集划分为训练集、验证集和测试集,对多种模型进行实验对比。”

pdf bib
基于多意图融合框架的联合意图识别和槽填充(A Multi-Intent Fusion Framework for Joint Intent Detection and Slot Filling)
Shangjian Yin (尹商鉴) | Peijie Huang (黄沛杰) | Dongzhu Liang (梁栋柱) | Zhuoqi He (何卓棋) | Qianer Li (黎倩尔) | Yuhong Xu (徐禹洪)

“近年来,多意图口语理解(SLU)已经成为自然语言处理领域的研究热点。当前先进的多意图SLU模型采用图-交互式框架进行联合多意图识别和槽位填充,能够有效地捕捉到词元级槽位填充任务的细粒度意图信息,取得了良好的性能。但是,它忽略了联合作用下的意图所包含的丰富信息,没有充分利用多意图信息对槽填充任务进行指引。为此,本文提出了一种基于多意图融合框架(MIFF)的联合多意图识别和槽填充框架,使得模型能够在准确地识别不同意图的同时,利用意图信息为槽填充任务提供更充分的指引。我们在MixATIS和MixSNIPS两个公共数据集上进行了实验,结果表明,我们的模型在性能和效率方面均超过了当前最先进的方法,同时能够有效从单领域数据集泛化到多领域数据集上。”

pdf bib
基于词频效应控制的神经机器翻译用词多样性增强方法(Improving Word-level Diversity in Neural Machine Translation by Controlling the Effects of Word Frequency)
Xuewen Shi (史学文) | Ping Jian (鉴萍) | Yikun Tang (唐翼琨) | Heyan HUang (黄河燕)

“通过最大似然估计优化的神经机器翻译(NMT)容易出现不可最大化的标记或低频词精度差等问题,这会导致生成的翻译缺乏词级别的多样性。词频在训练数据上的不均衡分布是造成上述现象的原因之一。本文旨在通过限制词频对 NMT 解码时估计概率的影响来缓解上述问题。具体地,我们采用了基于因果推断理论的半同胞回归去噪框架,并结合本文提出的自适应去噪系数来控制词频对模型估计概率的影响,以获得更准确的模型估计概率,并丰富 NMT 译文用词的多样性。本文的实验在四个代表不同资源规模的翻译任务上进行,分别是维吾尔语-汉语、汉语-英语、英语-德语和英语-法语。实验结果表明,本文所提出的方法在提升 NMT 译文词级别多样性的同时,不会损害译文的质量。另外,本文提出的方法还具有模型无关、可解释性强等优点。”

pdf bib
基于语音文本跨模态表征对齐的端到端语音翻译(End-to-end Speech Translation Based on Cross-modal Representation Alignment of Speech and Text)
Ling Zhou, Guojiang ang Dong | Zhengtao Yu | Shengxiang Gao | Wenjun Wang | Houli Ma | 国江 周 | 凌 董 | 正涛 余 | 盛祥 高 | 文君 王 | 候丽 马

“端到端语音翻译需要解决源语言语音到目标语言文本的跨语言和跨模态映射,有限标注数据条件下,建立语音文本表征间的统一映射,缓解跨模态差异是提升语音翻译性能的关键。本文提出语音文本跨模态表征对齐方法,对语音文本表征进行多粒度对齐并进行混合作为并行输入,基于多模态表征的一致性约束进行多任务融合训练。在MuST-C数据集上的实验表明,本文所提方法优于现有端到端语音翻译跨模态表征相关方法,有效提升了语音翻译模型跨模态映射能力和翻译性能。”

pdf bib
基于离散化自监督表征增强的老挝语非自回归语音合成方法(A Discretized Self-Supervised Representation Enhancement based Non-Autoregressive Speech Synthesis Method for Lao Language)
Zijian Feng (冯子健) | Linqin Wang (王琳钦) | Shengxaing Gao (高盛祥) | Zhengtao Yu (余正涛) | Ling Dong (董凌)

“老挝语的语音合成对中老两国合作与交流意义重大,但老挝语语音发音复杂,存在声调、音节及音素等发音特性,现有语音合成方法在老挝语上效果不尽人意。基于注意力机制建模的自回归模型难以拟合复杂的老挝语语音,模型泛化能力差,容易出现漏字、跳字等灾难性错误,合成音频缺乏自然性和流畅性。本文提出基于离散化自监督表征增强的老挝语非自回归语音合成方法,结合老挝语的语言语音特点,使用老挝语音素粒度的标注时长信息构建非自回归架构声学模型,通过自监督学习的预训练语音模型来提取语音内容和声调信息的离散化表征,融入到声学模型中增强模型的语音生成能力,增强合成音频的流畅性和自然性。实验证明,本文方法合成音频达到了4.03的MOS评分,基于离散化自监督表征增强的非自回归建模方法,能更好的在声调、音素时长、音高等细粒度层面刻画老挝语的语音特性。”

pdf bib
面向机器翻译的汉英小句复合体转换生成能力调查(Investigation of the Clause Complexes Transfer and Generation Capability from Chinese to English for Machine Translation)
Fukun Xing (邢富坤) | Jianing Xu (徐佳宁)

“小句复合体由小句组合而成,不同语言在小句的组合模式上存在差异,该差异对机器翻译有何影响尚不清楚。本文以汉英机器翻译为例,选取多语体的汉语小句复合体及专家译文,从话头共享关系和共享类型两方面对主流机器翻译系统以及ChatGPT开展调查。结果显示,与专家译文相比,机器翻译的小句复合体转换生成能力存在较大不足,表现为机器翻译在话头补足、转换、提炼等方面的能力较弱,小句组合模式单一且带有明显的汉语原文痕迹,译文的地道性受到较大影响。”

pdf bib
基于端到端预训练模型的藏文生成式文本摘要(Abstractive Summarization of Tibetan Based on end-to-end Pre-trained Model)
Shuo Huang (黄硕) | Xiaodong Yan (闫晓东) | Xinpeng OuYang (欧阳新鹏) | Jinpeng Yang (杨金鹏)

“近年来,预训练语言模型受到了广泛的关注,这些模型极大地促进了自然语言处理在不同下游任务中的应用。文本摘要作为自然语言处理中的一个重要分支,可以有效的减少冗余信息,从而提高浏览文本速度。藏文作为低资源语言,缺乏用于大规模的训练语料,藏文生成式文本摘要研究还处于起步阶段,为了解决藏文生成式文本摘要的问题,本文首次提出将端到端的预训练语言模型CMPT(Chinese Minority Pre-Trained Language Model)用于藏文生成式文本摘要研究,CMPT模型通过对其他不同低资源语言文本进行去噪和对比学习,同时为了提高编码器的理解能力,在编码器的输出层增加一个单层掩码语言模型(MLM)解码器,进行Seq2Seq的生成和理解的联合预训练。通过进一步微调可以有效地提高在藏文文本摘要任务上的性能。为了验证模型的性能,我们在自己构建的5w条藏文文本摘要数据集和公开数据集Ti-SUM上进行实验,在两个数据集上的实验表明,我们提出的方法在藏文生成式文本摘要的评测指标上有显著提升。同时,该方法不仅可以应用于藏文文本摘要任务,也可以拓展到其他语言的文本摘要任务中,具有较好的推广价值。”

pdf bib
融合多粒度特征的缅甸语文本图像识别方法(Burmese Language Recognition Method Fused with Multi-Granularity Features)
Enyu He (何恩宇) | Rui Chen (陈蕊) | Cunli Mao (毛存礼) | Yuxin Huang (黄于欣) | Shengxaing Gao (高盛祥) | Zhengtao Yu (余正涛)

“缅甸语属于东南亚低资源语言,缅甸语文本图像识别对开展缅甸语机器翻译等任务具有重要意义。由于缅甸语属于典型的字符组合型语言,一个感受野内存在多个字符嵌套,现有缅甸语识别方法主要是从字符粒度进行识别,在解码时会出现某些字符未能正确识别而导致局部乱码。考虑到缅甸语存在特殊的字符组合规则,本文提出了一种融合多粒度特征的缅甸语文本图像识别方法,将较细粒度的字符粒度和较粗粒度的字符簇粒度进行序列建模,然后将两种粒度特征序列进行融合后利用解码器进行解码。实验结果表明,该方法能够有效缓解识别结果乱码的现象,并且在人工构建的数据集上相比“VGG16+BiLSTM+Transformer”的基线模型识别准确率提高2.4%,达到97.35%。 "

pdf bib
TiKEM:基于知识增强的藏文预训练语言模型(TiKEM: Knowledge Enhanced Tibetan Pre-trained Language Model)
Junjie Deng (邓俊杰) | Long Chen (陈龙) | Yan Zhang (张廷) | YUan Sun (孙媛) | Xiaobin Zhao (赵小兵)

“预训练语言模型在中英文领域有着优异的表现,而低资源语言数据获取难度大,预训练语言模型在低资源语言如藏文上的研究刚取得初步进展。现有的藏文预训练语言模型,使用大规模无结构的文本语料库进行自监督学习,缺少外部知识指导,知识记忆能力和知识推理能力受限。为了解决以上问题,本文构建含有50万个三元组知识的藏文知识增强预训练数据集,联合结构化的知识表示和无结构化的文本表示,训练基于知识增强的藏文预训练语言模型TiKEM,以提高模型的知识记忆和推理能力。最后,本文在文本分类、实体关系分类和机器阅读理解三个下游任务中验证了模型的有效性。”

pdf bib
TiKG-30K:基于表示学习的藏语知识图谱数据集(TiKG-30K: A Tibetan Knowledge Graph Dataset Based on Representation Learning)
Wenhao Zhuang (庄文浩) | Ge Gao (高歌) | Yuan Sun (孙媛)

“知识图谱的表示学习旨在通过将实体和关系映射到低维向量空间中来学习知识图谱数据之间的复杂语义关联,为信息检索、智能问答、知识推理等研究提供了支撑。目前知识图谱的表示学习研究主要集中在英、汉等语言,公开高质量数据集(如FB15k-237,WN18RR)对其研究起到非常重要的作用。但是,对于低资源语言(如藏语),由于缺少公开的知识图谱数据集,相关研究任务还处于起步阶段。基于此,本文提出一个公开的藏语知识图谱数据集TiKG-30K,包含了146679个三元组,30986个实体和641种关系,可应用于知识图谱的表示学习及下游任务。针对现有藏语知识图谱数据量少、数据稀疏的问题,本文利用藏文三元组中实体的同指关系,借助其他语言丰富的知识库和非文本介质对知识库进行扩充,通过跨语言近义词检索、合并同义实体和关系、修正错误三元组等技术对知识图谱进行多层优化,最终构建了藏语知识图谱数据集TiKG-30K。最后,本文采用多种经典表示学习模型在TiKG-30K进行了实验,并与英文数据集FB15k-237、WN18RR以及藏文数据集TD50K进行了对比,结果表明,TiKG-30K可以与FB15k-237、WN18RR数据集相媲美。本文将TiKG-30K数据集公开,http://tikg-30k.cmli-nlp.com

pdf bib
噪声鲁棒的蒙古语语音数据增广模型结构(Noise robust Mongolian speech data augmentation model structure)
Zhiqaing Ma (马志强) | Jiaqi Sun (孙佳琦) | Jinyi Li (李晋益) | Jiatai Wang (王嘉泰)

“蒙古语语料库中语音多样性匮乏,虽然花费人力和经费收集数据在一定程度上能够增加语音的数量,但整个过程需要耗费大量的时间。数据增广能够解决这种数据匮乏问题,但数据增广模型的训练数据包含的环境噪声无法控制,导致增广语音中存在背景噪声。本文提出一种TTS和语音增强相结合的语音数据增广方法,以语音的频谱图为基础,从频域和时域两个维度进行语音增强。通过多组实验证明,蒙古语增广语音的合格率达到70%,增广语音的CBAK和COVL分别下降了0.66和0.81,WER和SER下降了2.75%和2.05%。”

pdf bib
基于数据增强的藏文机器阅读有难度问题的生成(Difficult Question Generation of Tibetan Machine Reading Based on Data Enhancement)
Zhengcuo Dan (旦正错) | Long Chen (陈龙) | Junjie Deng (邓俊杰) | Xian Pang (庞仙) | Yuan Sun (孙媛)

“问题生成是机器阅读理解数据集构建的子任务,指让计算机根据给定有(无)答案的上下文,生成流利通顺的问题集。在中英文领域,以端到端为基础的问题生成模型已经得到了很好的发展,并且构建了大批高质量的问答对。但是在低资源语言(藏文)领域,以机器阅读理解、智能问答系统为代表的数据驱动型任务中仍然普遍存在数据量较少和问答对过于简单的问题。因此,本文提出了三种面向藏文机器阅读的有难度问题的生成方法:(1)基于藏文预训练语言模型进行掩码、替换关键词生成不可回答问题。(2)根据相似段落的问题交叉生成不可回答的问题。(3)根据三元组生成具有知识推理的问题。最后,本文在构建的数据集上进行了实验,结果表明,包含不可回答、知识推理等类型的机器阅读理解数据集对模型的理解能力提出了更高的要求。另外,对构建的不可回答问题,从数据集的可读性、关联性和可回答性三个层面验证了数据集的质量。”

pdf bib
融合预训练模型的端到端语音命名实体识别(End-to-End Speech Named Entity Recognition with Pretrained Models)
Tianwei Lan (兰天伟) | Yuhang Guo (郭宇航)

“语音命名实体识别(Speech Named Entity Recognition, SNER)旨在从音频中识别出语音中命名实体的边界、种类和内容,是口语理解中的重要任务之一。直接从语音中识别出命名实体,即端到端方法是SNER目前的主流方法。但是语音命名实体识别的训练语料较少,端到端模型存在以下问题:(1)在跨领域识别的情况下模型的识别效果会有大幅度的下降。(2)模型在识别过程中会因同音词等现象对命名实体漏标、错标,进一步影响命名实体识别的准确性。针对问题(1),本文提出使用预训练实体识别模型构建语音实体识别的训练语料。针对问题(2),本文提出采用预训练语言模型对语音命名实体识别的N-BEST列表重打分,利用预训练模型中的外部知识帮助端到端模型挑选出最好的结果。为了验证模型的领域迁移能力,本文标注了少样本口语型数据集MAGICDATA-NER,在此数据上的实验表明,本文提出的方法相对于传统方法在F1值上有43.29%的提高。”

pdf bib
基于词向量的自适应领域术语抽取方法(An Adaptive Domain-Specific Terminology Extraction Approach Based on Word Embedding)
Xi Tang (唐溪) | Dongchen Jiang (蒋东辰) | Aoyuan Jiang (蒋翱远)

“术语分布呈现长尾特性。为了有效提取低频术语,本文提出了一种基于词向量的自适应术语抽取方法。该方法使用基于假设检验的统计方法,自适应地确定筛选阈值,通过逐步合并文本的强关联性字符串获得候选术语,避免了因固定阈值导致的低频术语遗漏问题;其后,本文基于掩码语言模型获得未登录候选术语的词向量,并通过融合词典知识的密度聚类算法获得候选术语归属的领域簇,将归属于目标领域簇的候选术语认定为领域术语。实验结果表明,我们的方法不仅在但值上优于对比方法,而且在不同体裁的文本中表现更为稳定。该方法能够全面有效地抽取出低频术语,实现领域术语的高质量提取。”

pdf bib
基于句法特征的事件要素抽取方法(Syntax-aware Event Argument Extraction )
Zijian Yu (余子健) | Tong Zhu (朱桐) | Wenliang Chen (陈文亮)

“事件要素抽取(Event Argument Extraction, EAE)旨在从非结构化文本中提取事件参与要素。编码器—解码器(Encoder-Decoder)框架是处理该任务的一种常见策略,此前的研究大多只向编码器端输入文本的字词信息,导致模型泛化和远程依赖处理能力较弱。为此,本文提出一种融入句法信息的事件要素抽取模型。首先对文本分析得到成分句法解析树,将词性标签和各节点的句法成分标签编码,增强模型的文本表征能力。然后,本文提出了一种基于树结构的注意力机制(Tree-Attention)辅助模型更好地感知结构化语义信息,提高模型处理远距离依赖的能力。实验结果表明,本文所提方法相较于基线系统F1值提升2.02%,证明该方法的有效性。”

pdf bib
相似音节增强的越汉跨语言实体消歧方法(Similar syllable enhanced cross-lingual entity disambiguation for Vietnamese-Chinese)
Yujuan Li (李裕娟) | Ran Song (宋燃) | Cunli Mao (毛存礼) | Yuxin Huang (黄于欣) | Shengxiang Gao (高盛祥) | Shan Lu (陆杉)

“跨语言实体消歧是在源语言句子中找到目标语言相对应的实体,对跨语言自然语言处理任务有重要支撑。现有跨语言实体消歧方法在资源丰富的语言上能得到较好的效果,但在资源稀缺的语种上效果不佳,其中越南语-汉语就是一对典型的低资源语言;另一方面,汉语和越南语是非同源语言存在较大差异,跨语言表征困难;因此现有的方法很难适用于越南语-汉语的实体消歧。事实上,汉语和越南语具有相似的音节特点,能够增强越-汉跨语言的实体表示。为更好的融合音节特征,我们提出相似音节增强的越汉跨语言实体消歧方法,缓解了越南语-汉语数据稀缺和语言差异导致性能不佳。实验表明,所提出方法优于现有的实体消歧方法,在R@1指标下提升了5.63%。”

pdf bib
英汉动物词的认知属性计量研究(Quantitative studies of congnitive attributes of English and Chinese animal words)
Ling Hua (华玲) | Bin Li (李斌) | Minxuan Feng (冯敏萱) | Haibo Kuang (匡海波)

“动物词承载了大量人类社会认知映射,不同民族对于同一个词的认知有所异同。通过隐喻研究动物词认知差异是近年来十分流行的趋势,反映人们对词语认知印象的认知属性就是一个简捷的切入口。本文选择《中华传统文化名词认知属性库》中的54种动物,借助中英文认知属性数据库,对比分析英汉语言中的认知属性差异。文章发现动物词的英汉认知属性之间有明显差异,且差异更多表现在主观属性上,并发现了中英文中动物词认知属性的整体异同。”

pdf bib
融合词典信息的古籍命名实体识别研究(A Study on the Recognition of Named Entities of Ancient Books Using Lexical Information)
Wenjun Kang (康文军) | Jiali Zuo (左家莉) | Anquan Jie (揭安全) | Wenbin Luo (罗文兵) | Mingwen Wang (王明文)

“古籍命名实体识别对于古籍实体知识库与语料库的建设具有显著的现实意义。目前古籍命名实体识别的研究较少,主要原因是缺乏足够的训练语料。本文从《资治通鉴》入手,人工构建了一份古籍命名实体识别数据集,以此展开对古籍命名实体识别任务的研究。针对古籍文本多以单字表意且存在大量省略的语言特点,本文采用预训练词向量作为词典信息,充分利用其中蕴涵的词汇信息。实验表明,这种方法可以有效处理古籍文本中人名实体识别的问题。”

pdf bib
结合全局对应矩阵和相对位置信息的古汉语实体关系联合抽取(Joint Extraction of Ancient Chinese Entity Relations by Combining Global Correspondence Matrix and Relative Position Information)
Yiyu Hu (胡益裕) | Jiali Zuo (左家莉) | Xueqiang Ceng (曾雪强) | Zhongying Wan (万中英) | Mingwen Wang (王明文)

“实体关系抽取是信息抽取领域中一项重要任务,目前实体关系抽取任务主要聚焦于英文和现代汉语领域,关于古汉语领域的数据集构建和方法的研究目前却较少。针对这一问题,本文在研究了开源的《资治通鉴》语料后,人工构建了一个古汉语实体关系数据集,并设计了一种结合全局对应矩阵和相对位置信息的实体关系联合抽取方法。最后通过在本文构建的数据集上进行实验,证明了该方法在古汉语实体关系抽取任务上的有效性。”

pdf bib
数字人文视域下的青藏高原文旅知识图谱构建研究——以塔尔寺为例(Research on the Construction of Cultural and Tourism Knowledge Atlas on the Qinghai-Tibet Plateau from the Perspective of Digital Humanity——A case study of Kumbum Monastery)
Xinhao Li (李鑫豪) | Weina Zhao (赵维纳) | Wanyi Zhao (赵婉亦) | Chaoqun Li (李超群)

“青藏地区多元的民族构成以及悠久的历史沉淀孕育出丰富且独特的青藏文化,使得这片雪域圣地焉然成为了“高原文化宝库”。然而受闭塞的交通条件和较滞后的经济水平的限制,青藏地区文旅资源的保护与弘扬工作始终处于滞后状态。本文以数字人文为导向,在提示学习框架下采用联合学习的方式对文本中实体与关系的抽取,实现低资源条件下的知识抽取,形成一套文旅知识图谱构建范式,并以全国重点文物保护单位‘塔尔寺’为代表,完整的介绍了塔尔寺知识图谱从本体设计、原始数据获取、知识抽取到可视化展示的详细流程。最终,本文所构建的塔尔寺知识图谱共包含4705个节点及17386条关系。本文的工作弥补了人文领域青藏文化的结构化数据不足的问题,同时为青藏文旅在数字人文领域的研究提供参考。”

pdf bib
基于互信息最大化和对比损失的多模态对话情绪识别模型(Multimodal Emotion Recognition in Conversation with Mutual Information Maximization and Contrastive Loss)
Qianer Li (黎倩尔) | Peijie Huang (黄沛杰) | Jiawei Chen (陈佳炜) | Jialin Wu (吴嘉林) | Yuhong Xu (徐禹洪) | Peiyuan Lin (林丕源)

“多模态的对话情绪识别(emotion recognition in conversation,ERC)是构建情感对话系统的关键。近年来基于图的融合方法在会话中动态聚合多模态上下文特征,提高了模型在多模态对话情绪识别方面的性能。然而,这些方法都没有充分保留和利用输入数据中的有价值的信息。具体地说,它们都没有保留从输入到融合结果的任务相关信息,并且忽略了标签本身蕴含的信息。本文提出了一种基于互信息最大化和对比损失的多模态对话情绪识别模型MMIC来解决上述的问题。模型通过在输入级和融合级上分级最大化模态之间的互信息(mutual information),使任务相关信息在融合过程中得以保存,从而生成更丰富的多模态表示。本文还在基于图的动态融合网络中引入了监督对比学习(supervised contrastive learning),通过充分利用标签蕴含的信息,使不同情绪相互排斥,增强了模型识别相似情绪的能力。在两个英文和一个中文的公共数据集上的大量实验证明了所提出模型的有效性和优越性。此外,在所提出模型上进行的案例探究有效地证实了模型可以有效保留任务相关信息,更好地区分出相似的情绪。消融实验和可视化结果证明了模型中每个模块的有效性。”

pdf bib
基于语义任务辅助的方面情感分析(Semantic Task-assisted Aspect-based Sentiment Analysis)
Zhaozhen Wu (吴肇真) | Hui Zhao (赵晖) | Tiquan Gu (谷体泉) | Guoyi Cao (曹国义)

“方面情感分析(Aspect-Based Sentiment Analysis,ABSA)任务旨在判断一句话中不同方面的细粒度情感极性。如何有效的捕获句子的语义信息是该任务的关键。现有的大多数分类方法通过引入外部知识并设计复杂的模块来理解句子的语义信息,而忽略了外部解析器的噪音及模型的复杂化。在本文中,我们提出了一种基于语义理解的多任务学习网络,它旨在通过多任务学习从原始语料中捕获句子的语义信息。本文考虑从多任务角度出发,在具有共享参数的原始数据集中,分别提出了两个语义辅助任务:方面上下文顺序预测任务和方面上下文句法依存预测任务。然后,将辅助任务与原始的方面情感分类任务进行多任务的训练得到增强了语义理解的编码器,最后将该编码器用于方面情感分类任务。实验结果表明,模型在三个主要的公开数据集Rest14、Lap14和Twitter上的准确率和Macro-F1值都有较好的表现。”

pdf bib
中国社会道德变化模型与发展动因探究——基于70年《人民日报》的计量与分析 (The Model of Moral Change and Motivation in Chinese Society ——The Vocabulary Analysis of the 70-year ”People’s Daily”)
Hongrui Wang (王弘睿) | Dong Yu (于东) | Pengyuan Liu (刘鹏远) | Liying Ceng (曾立英)

“社会道德的历时变迁研究具有重要意义。通过观察语言使用与道德变迁的历时联系,能够帮助描绘社会道德的变化趋势和发展规律、把握社会道德动态、推进道德建设。目前缺少从词汇角度、利用计算手段对大规模历时语料进行系统、全面的社会道德变迁研究。基于此,该文提出道德主题词历时计量模型,通过计量指标对1946-2015共70年的《人民日报》语料进行了历时计算与分析,观察了70年社会道德主题词的使用选择与变化。研究结果发现,道德词汇的历时使用与社会道德之间存在互动关系,反映出70年中国社会道德的历时变革与发展情况。”

pdf bib
动词视角下的汉语性别表征研究——基于多语体语料库与依存分析(Gendered Representation in Chinese via Verbal Analysis —Based on a Multi-register Corpus and Dependency Parsing)
Yingshi Chen (陈颖诗) | Dong Yu (于东) | Pengyuan Liu (刘鹏远)

“动作是反映性别社会化的重要形式,研究汉语中动词的性别表征,可以找到语言构建不同性别身份的路径,即所采用的方式、形式。本文以依存句法关系为抓手,在四种语体的语料中抽取出和不同性别词构成依存结构的动词,统计出有显著性别差异的动词,并根据性别词充当的句子成分,结合语义进行了定量和定性分析。总体来看,大部分汉语动词表征是中性的,能体现性别的动词是少数,汉语作为一种承载着中华智慧且具有深厚文化底蕴的语言,对性别的表征是中立且平等的,这也体现出了我国的性别平等观念。而在表征性别的动词中,能看到构建男性和女性身份的两种不同路径。显著表征女性的动词在不同语体的语料中均多于显著表征男性的,但是表征男性的动词的语义分布则更为均衡,体现了“男性默认-女性专门”。在司法动词上,女性常常作为暴力行为的受害者,同时施害者男性却隐身了,体现了筜男性主宰笭女性顺从笢。不同语体的动词在构建性别时体现了不同的功能,新闻塑造了较为传统的性别规范,传统和网络文学以不同的形式打破了固有的性别规范。”

pdf bib
基于多任务多模态交互学习的情感分类方法(Sentiment classification method based on multitasking and multimodal interactive learning)
Peng Xue (薛鹏) | Yang Li (李旸) | Suge Wang (王素格) | Jian Liao (廖健) | Jianxing Zheng (郑建兴) | Yujie Fu (符玉杰) | Deyu Li (李德玉)

“随着社交媒体的快速发展,多模态数据呈爆炸性增长,如何从多模态数据中挖掘和理解情感信息,已经成为一个较为热门的研究方向。而现有的基于文本、视频和音频的多模态情感分析方法往往将不同模态的高级特征与低级特征进行融合,忽视了不同模态特征层次之间的差异。因此,本文采用以文本模态为中心,音频模态和视频模态为补充的方式,提出了多任务多模态交互学习的自监督动态融合模型。通过多层的结构,构建了单模态特征表示与两两模态特征的层次融合表示,使模型将不同层次的特征进行融合,并设计了从低级特征渐变到高级特征的融合策略。为了进一步加强多模态特征融合,使用了分布相似性损失函数和异质损失函数,用于学习模态的共性表征和特性表征。在此基础上,利用多任务学习,获得模态的一致性及差异性特征。通过在CMU-MOSI和CMU-MOSEI数据集上分别实验,实验结果表明本文模型的情感分类性能优于基线模型。”

pdf bib
基于动态常识推理与多维语义特征的幽默识别(Humor Recognition based on Dynamically Commonsense Reasoning and Multi-Dimensional Semantic Features)
Tuerxun Tunike | Hongfei Lin | Dongyu Zhang | Liang Yang | Changrong Min | 吐尔逊 吐妮可 | 鸿飞 林 | 冬瑜 张 | 亮 杨 | 昶荣 闵

“随着社交媒体的飞速发展,幽默识别任务在近年来受到研究者的广泛关注。该任务的目标是判断给定的文本是否表达幽默。现有的幽默识别方法主要是在幽默产生理论的支撑下,利用规则或者设计神经网络模型来提取多种幽默相关特征,比如不一致性特征、情感特征以及语音特征等等。这些方法一方面说明情感信息在建模幽默语义当中的重要地位,另一方面说明幽默语义的构建依赖多个维度的特征。然而,这些方法没有充分捕捉文本内部的情感特征,忽略了幽默文本中的隐式情感表达,影响幽默识别的准确性。为了解决这一问题,本文提出一种动态常识与多维语义特征驱动的幽默识别方法CMSOR。该方法首先利用外部常识信息从文本中动态推理出说话者的隐式情感表达,然后引入外部词典WordNet计算文本内部词级语义距离进而捕捉不一致性,同时计算文本的模糊性特征。最后,根据上述三个特征维度构建幽默语义,实现幽默识别。本文在三个公开数据集上进行实验,结果表明本文所提方法CMSOR相比于当前基准模型有明显提升。”

pdf bib
融合Synonyms 词库的专利语义相似度计算研究(Patent Semantic Similarity Calculation by Fusing Synonyms Database)
Xinyu Tong (佟昕瑀) | Jialun Liao (廖佳伦) | Yonghe Lu (路永和)

“一直以来,专利相似度计算和比较等工作都由专利审查员人工进行并做出准确判断。然而,以人工方式分析和研判专利的原创性、实用性以及是否侵权等工作需要投入大量的人力物力资源且效率较低。基于此,本文将ALBERT预训练模型用于专利的文本表示,并通过引入Synonyms近义词库增强专利文本的语义表达能力,探索一种基于语义知识库和深度学习的专利文本表示模型与相似度计算方法。实验结果表明,加入Synonyms近义词库消歧后的专利文本相似性度量的实验准确率有一定的提升。”

pdf bib
中医临床切诊信息抽取与词法分析语料构建及联合建模方法(On Corpus Construction and Joint Modeling for Clinical Pulse Feeling and Palpation Information Extraction and Lexical Analysis of Traditional Chinese Medicine)
Yaqiang Wang (王亚强) | Wen Jiang (蒋文) | Yongguang Jiang (蒋永光) | Hongping Shu (舒红平)

“切诊是中医临床四诊方法中极具中医特色的疾病诊察方法,为中医临床辨证论治提供重要的依据,中医临床切诊信息抽取与词法分析研究具有重要的临床应用价值。本文首次开展了中医临床切诊信息抽取与词法分析语料构建及联合建模方法研究,以万余条中医临床记录为研究对象,提出了一种语料构建框架,分别制定了中医临床切诊信息抽取、中文分词和词性标注语料标注规范,形成了可支撑多任务联合建模的语料,语料最终的标注一致性达到0.94以上。基于同级多任务共享编码参数模型,探索了中医临床切诊信息抽取与词法分析联合建模方法,并验证了该方法的有效性。”

pdf bib
大规模语言模型增强的中文篇章多维度阅读体验量化研究(Quantitative Research on Multi-dimensional Reading Experience of Chinese Texts Enhanced by Large Language Model)
Jiadai Sun (孙嘉黛) | Siyi Tang (汤思怡) | Shike Wang (王诗可) | Dong Yu (于东) | Pengyuan Liu (刘鹏远)

“现有的文本分级阅读研究往往从文本可读性的角度出发,以离散的文本难度等级的形式为读者推荐阅读书目。目前,仍缺少一种研究读者在阅读过程中产生的多方面、深层次阅读体验的体系结构。对此,我们调研了读者在阅读中文篇章过程中产生的不同阅读体验,提出了中文篇章多维度阅读体验的量化体系。我们将阅读过程中呈现的连续性的阅读体验归纳为多种类别,并在此基础上构建了中文篇章多维度阅读体验数据集。同时,我们探究了以大规模语言模型为基础的ChatGPT对阅读体验的量化能力,发现其虽具备强大的信息抽取和语义理解能力,在阅读体验的量化上却表现不佳。但我们发现大规模语言模型所蕴含的能力能够以知识蒸馏的方式协助深层属性的量化,基于此,我们实现了大规模语言模型增强的中文篇章多维阅读体验量化模型。模型在各维度阅读体验上的平均F1值达到0.72,高于ChatGPT的Fewshot结果0.48。”

pdf bib
融合文本困惑度特征和相似度特征的推特机器人检测方法∗(Twitter robot detection method based on text perplexity feature and similarity feature)
Zhongjie Wang (王钟杰) | ZZhaowen Zhang (张朝文) | Wenqi Ding (丁文琪) | Yumeng Fu (付雨濛) | Lili Shan (单丽莉) | Bingquan Liu (刘秉权)

“推特机器人检测任务的目标是判断一个推特账号是真人账号还是自动化机器人账号。随着自动化账号拟人算法的快速迭代,检测最新类别的自动化账号变得越来越困难。最近,预训练语言模型在自然语言生成任务和其他任务上表现出了出色的水平,当这些预训练语言模型被用于推特文本自动生成时,会为推特机器人检测任务带来很大挑战。本文研究发现,困惑度偏低和相似度偏高的现象始终出现在不同时代自动化账号的历史推文中,且此现象不受预训练语言模型的影响。针对这些发现,本文提出了一种抽取历史推文困惑度特征和相似度特征的方法,并设计了一种特征融合策略,以更好地将这些新特征应用于已有的算法模型。本文方法在选定数据集上的性能超越了已有的基准方法,并在人民网主办、传播内容认知全国重点实验室承办的社交机器人识别大赛上取得了冠军。”

pdf bib
差比句结构及其缺省现象的识别补全研究(A Study on Identification and Completion of Comparative Sentence Structures with Ellipsis Phenomenon)
Pengfei Zhou (周鹏飞) | Weiguang Qv (曲维光) | Tingxin Wei (魏庭新) | Junsheng Zhou (周俊生) | Bin Li (李斌) | Yanhui Gu (顾彦慧)

“差比句是用来表达两个或多个事物之间的相似或不同之处的句子结构,常用句式为“X比Y+比较结果”。差比句存在多种结构变体且大量存在省略现象,造成汉语语法研究和自然语言处理任务困难,因此实现差比句结构的识别和对其缺省结构进行补全非常有意义。本文采用序列化标注方法构建了一个差比句语料库,提出了一个能够融合字与词信息的LatticeBERT-BILSTM-CRF模型来对差比句结构自动识别,并且能对缺省单位进行自动补全,实验结果验证了方法的有效性。”

pdf bib
基于框架语义场景图的零形式填充方法(A Null Instantiation Filling Method based Frame Semantic Scenario Graph)
Yuzhi Wang (王俞智) | Ru Li (李茹) | Xuefeng Su (苏雪峰) | Zhichao Yan (闫智超) | Juncai Li (李俊材)

“零形式填充是在篇章上下文中为给定句子中的隐式框架语义角色找到相应的填充内容。传统的零形式填充方法采用pipeline模型,容易造成错误传播,并且忽略了显式语义角色及其填充内容的重要性。针对上述问题,本文提出了一种端到端的零形式填充方法,该方法结合汉语框架网信息构建出框架语义场景图并利用GAT对其建模,得到融合了显式框架元素信息的候选填充项表示,增强了模型对句中隐式语义成分的识别能力。在汉语零形式填充数据集上的实验表明,本文提出的模型相较于基于Bert的基线模型F1值提升了9.16%,证明了本文提出方法的有效性。”

pdf bib
基于FLAT的农业病虫害命名实体识别(Named Entity Recognition of Agricultural Pests and Diseases based on FLAT)
Yi Ren (任义) | Jie Shen (沈洁) | Shuai Yuan (袁帅)

“针对传统命名实体识别方法中词嵌入无法表征一词多义及字词融合的模型存在特征提取不够准确的问题,本文提出了一种基于FLAT的交互式特征融合模型,该模型首先通过外部词典匹配获得字、词向量,经过BERT预训练后,通过设计的交互式特征融合模块充分挖掘字词间的依赖关系。另外,引入对抗训练提升模型的鲁棒性。其次,采用了特殊的相对位置编码将数据输入到自注意力机制,最后通过CRF得到全局最优序列。本文模型在农业病虫害数据集上识别的准确率、召回率、F1值分别达到了93.76%、92.14%和92.94%。”

pdf bib
基于结构树库的补语位形容词语义分析及搭配库构建∗(Semantic analysis of complementary adjectives and construction of collocation database based on structural tree library)
Tian Siyu (思雨 田) | Shao Tian (田 邵) | Xun Endong (恩东 荀) | Rao Gaoqi (高琦 饶)

“在形容词充当补语的粘合式述补结构1中,通常以两个谓词性成分连用(”形容词+形容词”、“动词+形容词”)的形式出现,由于这一结构没有形式标记,为计算机自动识别该结构带来了较大的难度,同时,形容词充当补语并不是其最基本、典型(作定语、谓语)的用法,在语言学界与计算语言学界也没有受到足够的关注。因此,该文以补语位的形容词为研究对象,从大规模的句法结构树库中抽取形容词直接作补语的述补结构,并通过编程和人工校验的方式对语料进行降噪,对补语位形容词进行穷尽式检索,得到补语位形容词词表,进一步对补语位形容词的语义进行细分类,构建相应的语义搭配库。不仅可以提升句法切分的正确率,为深层句法语义分析提供语义信息,也可以为语言学本体的相关研究提供参考。”

pdf bib
基于BiLSTM聚合模型的汉语框架语义角色识别(Chinese Frame Semantic Role Identification Based on BiLSTM Aggregation Model)
Xuefei Cao (曹学飞) | Hongji Li (李济洪) | Ruibo Wang (王瑞波) | Qian Niu (牛倩)

“目前,基于神经网络的汉语框架语义角色识别模型的性能依然较低,考虑到神经网络模型的性能受到超参数的影响,本文将超参数调优和模型预测性能的提升统一到基于BiLSTM的聚合模型框架下解决。使用正则化交叉验证进行实验,通过正则化条件约束训练集和验证集的分布差异,避免分布不一致带来的性能波动。将交叉验证得到的结果进行众数投票,以投票后的结果对不同的超参数配置进行评估,并选择若干种没有显著差异的超参数配置构成最优的超参数配置集合。然后将最优的超参数配置集合对应的子模型进行聚合,构造汉语框架语义角色识别的聚合模型。实验结果显示,本文方法的性能较基准模型显著提升了9.56%。”

pdf bib
L2到L1的跨语言激活路径研究——基于词汇识别的ERP数据(Cross-lingual Activation Path from L2 to L1——Based on ERP Data during Word Recognition)
Siqin Yang (杨思琴) | Minghu Jiang (江铭虎)

“跨语言词汇激活模型是当下语言认知与计算研究的热门话题。本研究运用事件相关电位技术(event-related potentials,ERPs)探索了二语学习者在识别二语(second language,简称L2)词汇时激活母语(native language,简称L1)词汇表征的路径。研究设计了隐性启动范式来开展两个实验,通过观察被试能否感知只有激活L1词汇表征才能发现的对译词重复情况这一隐性条件来推测激活结果。脑电结果显示,实验一的被试在执行语义判断任务时,对译词重复与否产生了显著的N400差异,这表明被试经由概念表征激活了L1词汇表征,进而证明了激活路径Path-1(L2>L1)的存在;实验二的被试在执行书写形式判断任务时,在没有语义启动的情况下,同样感知到了对译词这一隐性条件,这表明他们可以由L2词汇表征直接激活L1词汇表征,从而证明了激活路径子Path-2(L2>L1)的存在。总体而言,词汇识别过程中从L2词汇表征到L1词汇表征的激活路径与修正层次模型(the Revised Hierarchical Model,RHM)描绘的词汇产出过程的激活路径类似。据此,本研究推测,尽管大脑在词汇识别和词汇产生过程中采用不同的处理机制,但在跨语言词汇激活过程中,它们依然存在某些共通之处。”

pdf bib
汉语语义构词的资源建设与计算评估(Construction of Chinese Semantic Word-Formation and its Computing Applications)
Yue Wang (王悦) | Yang Liu (刘扬) | Qiliang Liang (梁启亮) | Hansi Wang (王涵思)

“汉语是一种意合型语言,汉语中语素的构词方式与规律是描述、理解词义的重要因素。关于语素构词的方式,语言学界有语法构词与语义构词这两种观点,其中,语义构词对语素间关系的表达更为深入。本文采取语义构词的路线,基于语言学视角,考虑汉语构词特点,提出了一套面向计算的语义构词结构体系,通过随机森林自动标注与人工校验相结合的方式,构建汉语语义构词知识库,并在词义生成的任务上对该资源进行计算评估。实验取得了良好的结果,基于语义构词知识库的词义生成BLEU值达25.07,较此前的语法构词提升了3.17%,初步验证了这种知识表示方法的有效性。该知识表示方法与资源建设将为人文领域和信息处理等多方面的应用提供新的思路与方案。”

pdf bib
基于多尺度建模的端到端自动语音识别方法(An End-to-End Automatic Speech Recognition Method Based on Multiscale Modeling)
Hao Chen (陈昊) | Runlai Zhang (张润来) | Yuhao Zhang (张裕浩) | Chenghao Gao (高成浩) | Chen Xu (许晨) | Anxiang Ma (马安香) | Tong Xiao (肖桐) | Jingbo Zhu (朱靖波)

“近年来,基于深度学习的端到端自动语音识别模型直接对语音和文本进行建模,结构简单且性能上也具有显著优势,逐渐成为主流。然而,由于连续的语音信号与离散的文本在长度及表示尺度上存在巨大差异,二者间的模态鸿沟问题是该类任务一直存在的困扰。为解决该问题,本文提出了多尺度语音识别建模方法,该方法从利用细粒度分布知识的角度出发,建立多个不同尺度形式的文本信息,将特征序列从细粒度的低层次序列逐步对齐预测出文本序列。这种逐级预测的方式能够有效降低预测难度,缓解模态鸿沟带来的影响,并通过融合不同尺度下特征,提高语料信息的丰富性与完整性,进一步增强模型推理能力。本文在LibriSpeech小规模、大规模和TEDLIUM2数据集上实验,相比基线系统词错误率平均降低1.7、0.45和0.76,验证了方法的有效性。”

pdf bib
基于血缘关系结构的亲属关系推理算法研究(A Study on Kinship Inference Algorithm Based on Blood Relationship Structure)
Dawei Lu (卢达威) | Siqin Yang (杨思琴)

“以往的亲属关系推理系统,对推理的正确性无法保证,对复杂的亲属关系推理容易出错;而且难以解决多个亲属关系作为已知条件的亲属关系推理问题。本文在卢达威等(2019)的基础上,首先将推理规则和推理过程形式化和算法化;进而与基于一阶谓词逻辑的推理系统进行了对比,发现基于血缘关系结构的亲属关系推理在知识表示方法和推理规则方面都存在优势,主要表现在于执行效率更高,以及在编写和核查规则时更不容易出错;最后讨论了亲属关系推理算法的时间复杂度问题,发现该推理系统为是线性时间复杂度。本文的算法及其有效性分析得到了实验结果的支持。”

pdf bib
基于深加工语料库的《唐诗三百首》难度分级(The difficulty classification of ‘ Three Hundred Tang Poems ’ based on the deep processing corpus)
Yuyu Huang (黄宇宇) | Xinyu Chen (陈欣雨) | Minxuan Feng (冯敏萱) | Yunuo Wang (王禹诺) | Beiyuan Wang (蓓原王,) | Bin Li (李斌)

“为辅助中小学教材及读本中唐诗的选取,本文基于对《唐诗三百首》分词、词性、典故标记的深加工语料库,据诗句可读性创新性地构建了分级标准,共分4层,共计8项可量化指标:字层(通假字)、词层(双字词)、句层(特殊句式、标题长度、诗句长度)、艺术层(典故、其他修辞、描写手法)。据以上8项指标对语料库中313首诗评分,建立基于量化特征的向量空间模型,以K-means聚类算法将诗歌聚类以对应小学、初中和高中3个学段的唐诗学习。”

pdf bib
基于RoBERTa的中文仇恨言论侦测方法研究(Chinese Hate Speech detection method Based on RoBERTa-WWM)
Xiaojun Rao | Yangsen Zhang | Qilong Jia | Xueyang Liu | 晓俊 饶 | 仰森 张 | 爽 彭 | 启龙 贾 | 雪阳 刘

“随着互联网的普及,社交媒体虽然提供了交流观点的平台,但因其虚拟性和匿名性也加剧了仇恨言论的传播,因此自动侦测仇恨言论对于维护社交媒体平台的文明发展至关重要。针对以上问题,构建了一个中文仇恨言论数据集CHSD,并提出了一种中文仇恨言论侦测模型RoBERTa-CHHSD。该模型首先采用RoBERTa预训练语言模型对中文仇恨言论进行序列化处理,提取文本特征信息;再分别接入TextCNN模型和Bi-GRU模型,提取多层次局部语义特征和句子间全局依赖关系信息;将二者结果融合来提取文本中更深层次的仇恨言论特征,对中文仇恨言论进行分类,从而实现中文仇恨言论的侦测。实验结果表明,本模型在CHSD数据集上的F1值为89.12%,与当前最优主流模型RoBERTa-WWM相比提升了1.76%。”

pdf bib
汉语被动结构解析及其在CAMR中的应用研究(Parsing of Passive Structure in Chinese and Its Application in CAMR)
Kang Hu (康胡,) | Weiguang Qu (曲维光) | Tingxin Wei (魏庭新) | Junsheng Zhou (周俊生) | Bin Li (李斌) | Yanhui Gu (顾彦慧)

“汉语被动句是一种重要的语言现象。本文采用BIO结合索引的标注方法,对被动句中的被动结构进行了细粒度标注,提出了一种基于BERT-wwm-ext预训练模型和双仿射注意力机制的CRF序列标注模型,实现对汉语被动句中内部结构的自动解析,F1值达到97.31%。本文提出的模型具有良好的泛化性,实验证明,利用本文模型的被动结构解析结果对CAMR图后处理,能有效提高CAMR被动句解析任务的性能。”

pdf bib
人工智能生成语言与人类语言对比研究——以ChatGPT为例(A Comparative Study of Language between Artificial Intelligence and Human: A Case Study of ChatGPT)
Zhu Junhui (君辉 朱) | Wang Mengyan (梦焰 王) | Yang Erhong (尔弘 杨) | Nie Jingran (锦燃 聂) | Wang Yujie (誉杰 王) | Yue Yan (岩 岳) | Yang Liner (麟儿 杨)

“基于自然语言生成技术的聊天机器人ChatGPT能够快速生成回答,但目前尚未对机器作答所使用的语言与人类真实语言在哪些方面存在差异进行充分研究。本研究提取并计算159个语言特征在人类和ChatGPT对中文开放域问题作答文本中的分布,使用随机森林、逻辑回归和支持向量机(SVM)三种机器学习算法训练人工智能探测器,并评估模型性能。实验结果表明,随机森林和SVM均能达到较高的分类准确率。通过对比分析,研究揭示了两种文本在描述性特征、字词常用度、字词多样性、句法复杂性、语篇凝聚力五个维度上语言表现的优势和不足。结果显示,两种文本之间的差异主要集中在描述性特征、字词常用度、字词多样性三个维度。”

pdf bib
古汉语通假字资源库的构建及应用研究(The Construction and Application of an Ancient Chinese Language Resource on Tongjiazi)
Zhaoji Wang (王兆基) | Shirui Zhang (张诗睿) | Xuetao Zhang (张学涛) | Renfen Hu (胡韧奋)

“古籍文本中的文字通假现象较为常见,这不仅为人理解文意造成了困难,也是古汉语信息处理面临的一项重要挑战。为了服务于通假字的人工判别和机器处理,本文构建并开源了一个多维度的通假字资源库,包括语料库、知识库和评测数据集三个子库。其中,语料库收录11000余条包含通假现象详细标注的语料;知识库以汉字为节点,通假和形声关系为边,从字音、字形、字义多个角度对通假字与正字的属性进行加工,共包含4185个字节点和8350对关联信息;评测数据集面向古汉语信息处理需求,支持通假字检测和正字识别两个子任务的评测,收录评测数据19678条。在此基础上,本文搭建了通假字自动识别的系列基线模型,并结合试验结果分析了影响通假字自动识别的因素与改进方法。进一步地,本文探讨了该资源库在古籍整理、人文研究和文言文教学中的应用。”

pdf bib
SpaCE2022中文空间语义理解评测任务数据集分析报告(A Quality Assessment Report of the Chinese Spatial Cognition Evaluation Benchmark)
Xiao Liming (力铭 肖) | Sun Chunhui (春晖 孙) | Zhan Weidong (卫东 詹) | Xing Dan (丹 邢) | Li Nan (楠 李) | Wang Chengwen (诚文 王) | Zhu Fangwei (方韦 祝)

“第二届中文空间语义理解评测任务(SpaCE2022)旨在测试机器的空间语义理解能力,包括三个子任务:(1)中文空间语义正误判断任务;(2)中文空间语义异常归因与异常文本识别任务;(3)中文空间实体识别与空间方位关系标注任务。本文围绕SpaCE2022数据集介绍了标注规范和数据集制作流程,总结了改善数据集质量的方法,包括构建STEP标注体系,规范描述空间语义信息;基于语言学知识生成空间异常句子,提高数据多样性;采取双人标注、基于规则的实时质检、人工抽样审核等方式加强数据质量控制;分级管理标注数据,优选高质量数据进入数据集。通过考察数据集分布情况以及机器表现和人类表现,本文发现SpaCE2022数据集的标签分布存在明显偏差,而且正误判断任务和异常归因任务的主观性强,一致性低,这些问题有待在将来的SpaCE任务设计中做进一步优化。”

pdf bib
基于预训练语言模型的端到端概念体系构建方法(End to End Taxonomy Construction Method with Pretrained Language Model)
Wang Siyi (思懿 王) | He Shizhu (世柱 何) | Liu Kang (康 刘) | Zhao Jun (军 赵)

“概念体系描述概念间上下文关系并组织为层次结构,是一类重要的知识资源。本文研究概念体系的自动构建技术,致力于把一个给定的概念集合(词语集合)按照上下位关系,组织成树状结构的概念体系(概念树)。传统做法将概念体系构建任务分解为概念间上下位语义关系判断及概念层次结构构建这两个独立的子任务。两个子任务缺乏信息反馈,容易造成错误累积等问题。近年来,越来越多任务使用预训练语言模型获取词语的语义特征并判断词语间的语义关系,虽然在概念体系构建中取得了一定效果,但是这类做法只能建模第一个子任务,依然存在错误累计等问题。为了解决分步式方法存在的错误累计问题并有效获取词语及其关系语义特征,本文提出一种基于预训练语言模型的端到端概念体系构建方法,一方面利用预训练语言模型获取概念及其上下位关系的语义信息和部分概念体系结构的结构信息,另一方面利用强化学习端到端地建模概念关系判断和完整体系结构的生成。在WordNet数据集上的实验表明,本文所提方法能取得了良好效果,同等条件下,我们的F1值比最好的模型有7.3%的相对提升。”

pdf bib
Ask to Understand: Question Generation for Multi-hop Question Answering
Li Jiawei | Ren Mucheng | Gao Yang | Yang Yizhe

“Multi-hop Question Answering (QA) requires the machine to answer complex questions by find-ing scattering clues and reasoning from multiple documents. Graph Network (GN) and Ques-tion Decomposition (QD) are two common approaches at present. The former uses the “black-box” reasoning process to capture the potential relationship between entities and sentences, thusachieving good performance. At the same time, the latter provides a clear reasoning logical routeby decomposing multi-hop questions into simple single-hop sub-questions. In this paper, wepropose a novel method to complete multi-hop QA from the perspective of Question Genera-tion (QG). Specifically, we carefully design an end-to-end QG module on the basis of a classicalQA module, which could help the model understand the context by asking inherently logicalsub-questions, thus inheriting interpretability from the QD-based method and showing superiorperformance. Experiments on the HotpotQA dataset demonstrate that the effectiveness of ourproposed QG module, human evaluation further clarifies its interpretability quantitatively, andthorough analysis shows that the QG module could generate better sub-questions than QD meth-ods in terms of fluency, consistency, and diversity.”

pdf bib
Learning on Structured Documents for Conditional Question Answering
Wang Zihan | Qian Hongjin | Dou Zhicheng

“Conditional question answering (CQA) is an important task in natural language processing thatinvolves answering questions that depend on specific conditions. CQA is crucial for domainsthat require the provision of personalized advice or making context-dependent analyses, such aslegal consulting and medical diagnosis. However, existing CQA models struggle with generatingmultiple conditional answers due to two main challenges: (1) the lack of supervised training datawith diverse conditions and corresponding answers, and (2) the difficulty to output in a complexformat that involves multiple conditions and answers. To address the challenge of limited super-vision, we propose LSD (Learning on Structured Documents), a self-supervised learning methodon structured documents for CQA. LSD involves a conditional problem generation method anda contrastive learning objective. The model is trained with LSD on massive unlabeled structureddocuments and is fine-tuned on labeled CQA dataset afterwards. To overcome the limitation ofoutputting answers with complex formats in CQA, we propose a pipeline that enables the gen-eration of multiple answers and conditions. Experimental results on the ConditionalQA datasetdemonstrate that LSD outperforms previous CQA models in terms of accuracy both in providinganswers and conditions.”

pdf bib
Overcoming Language Priors with Counterfactual Inference for Visual Question Answering
Ren Zhibo | Wang Huizhen | Zhu Muhua | Wang Yichao | Xiao Tong | Zhu Jingbo

“Recent years have seen a lot of efforts in attacking the issue of language priors in the field ofVisual Question Answering (VQA). Among the extensive efforts, causal inference is regarded asa promising direction to mitigate language bias by weakening the direct causal effect of questionson answers. In this paper, we follow the same direction and attack the issue of language priorsby incorporating counterfactual data. Moreover, we propose a two-stage training strategy whichis deemed to make better use of counterfactual data. Experiments on the widely used bench-mark VQA-CP v2 demonstrate the effectiveness of the proposed approach, which improves thebaseline by 21.21% and outperforms most of the previous systems.”

pdf bib
Rethinking Label Smoothing on Multi-hop Question Answering
Yin Zhangyue | Wang Yuxin | Hu Xiannian | Wu Yiguang | Yan Hang | Zhang Xinyu | Cao Zhao | Huang Xuanjing | Qiu Xipeng

“Multi-Hop Question Answering (MHQA) is a significant area in question answering, requiringmultiple reasoning components, including document retrieval, supporting sentence prediction,and answer span extraction. In this work, we present the first application of label smoothing tothe MHQA task, aiming to enhance generalization capabilities in MHQA systems while miti-gating overfitting of answer spans and reasoning paths in the training set. We introduce a novellabel smoothing technique, F1 Smoothing, which incorporates uncertainty into the learning pro-cess and is specifically tailored for Machine Reading Comprehension (MRC) tasks. Moreover,we employ a Linear Decay Label Smoothing Algorithm (LDLA) in conjunction with curricu-lum learning to progressively reduce uncertainty throughout the training process. Experimenton the HotpotQA dataset confirms the effectiveness of our approach in improving generaliza-tion and achieving significant improvements, leading to new state-of-the-art performance on theHotpotQA leaderboard.”

pdf bib
Improving Zero-shot Cross-lingual Dialogue State Tracking via Contrastive Learning
Xiang Yu | Zhang Ting | Di Hui | Huang Hui | Li Chunyou | Ouchi Kazushige | Chen Yufeng | Xu Jinan

“Recent works in dialogue state tracking (DST) focus on a handful of languages, as collectinglarge-scale manually annotated data in different languages is expensive. Existing models addressthis issue by code-switched data augmentation or intermediate fine-tuning of multilingual pre-trained models. However, these models can only perform implicit alignment across languages. In this paper, we propose a novel model named Contrastive Learning for Cross-Lingual DST(CLCL-DST) to enhance zero-shot cross-lingual adaptation. Specifically, we use a self-builtbilingual dictionary for lexical substitution to construct multilingual views of the same utterance. Then our approach leverages fine-grained contrastive learning to encourage representations ofspecific slot tokens in different views to be more similar than negative example pairs. By thismeans, CLCL-DST aligns similar words across languages into a more refined language-invariantspace. In addition, CLCL-DST uses a significance-based keyword extraction approach to selecttask-related words to build the bilingual dictionary for better cross-lingual positive examples. Experiment results on Multilingual WoZ 2.0 and parallel MultiWoZ 2.1 datasets show that ourproposed CLCL-DST outperforms existing state-of-the-art methods by a large margin, demon-strating the effectiveness of CLCL-DST.”

pdf bib
Unsupervised Style Transfer in News Headlines via Discrete Style Space
Liu Qianhui | Gao Yang | Yang Yizhe

“The goal of headline style transfer in this paper is to make a headline more attractive whilemaintaining its meaning. The absence of parallel training data is one of the main problems in thisfield. In this work, we design a discrete style space for unsupervised headline style transfer, shortfor D-HST. This model decomposes the style-dependent text generation into content-featureextraction and style modelling. Then, generation decoder receives input from content, style,and their mixing components. In particular, it is considered that textual style signal is moreabstract than the text itself. Therefore, we propose to model the style representation space asa discrete space, and each discrete point corresponds to a particular category of the styles thatcan be elicited by syntactic structure. Finally, we provide a new style-transfer dataset, namedas TechST, which focuses on transferring news headline into those that are more eye-catchingin technical social media. In the experiments, we develop two automatic evaluation metrics— style transfer rate (STR) and style-content trade-off (SCT) — along with a few traditionalcriteria to assess the overall effectiveness of the style transfer. In addition, the human evaluationis thoroughly conducted in terms of assessing the generation quality and creatively mimicking ascenario in which a user clicks on appealing headlines to determine the click-through rate. Ourresults indicate the D-HST achieves state-of-the-art results in these comprehensive evaluations. Introduction”

pdf bib
Lexical Complexity Controlled Sentence Generation for Language Learning
Nie Jinran | Yang Liner | Chen Yun | Kong Cunliang | Zhu Junhui | Yang Erhong

“Language teachers spend a lot of time developing good examples for language learners. For this reason, we define a new task for language learning, lexical complexity controlledsentence generation, which requires precise control over the lexical complexity in thekeywords to examples generation and better fluency and semantic consistency. The chal-lenge of this task is to generate fluent sentences only using words of given complexitylevels. We propose a simple but effective approach for this task based on complexityembedding while controlling sentence length and syntactic complexity at the decodingstage. Compared with potential solutions, our approach fuses the representations of theword complexity levels into the model to get better control of lexical complexity. Andwe demonstrate the feasibility of the approach for both training models from scratch andfine-tuning the pre-trained models. To facilitate the research, we develop two datasetsin English and Chinese respectively, on which extensive experiments are conducted. Ex-perimental results show that our approach provides more precise control over lexicalcomplexity, as well as better fluency and diversity.”

pdf bib
Dynamic-FACT: A Dynamic Framework for Adaptive Context-Aware Translation
Chen Linqing | Wang Weilei

“Document-level neural machine translation (NMT) has garnered considerable attention sincethe emergence of various context-aware NMT models. However, these static NMT models aretrained on fixed parallel datasets, thus lacking awareness of the target document during infer-ence. In order to alleviate this limitation, we propose a dynamic adapter-translator frameworkfor context-aware NMT, which adapts the trained NMT model to the input document prior totranslation. Specifically, the document adapter reconstructs the scrambled portion of the originaldocument from a deliberately corrupted version, thereby reducing the performance disparity be-tween training and inference. To achieve this, we employ an adaptation process in both the train-ing and inference stages. Our experimental results on document-level translation benchmarksdemonstrate significant enhancements in translation performance, underscoring the necessity ofdynamic adaptation for context-aware translation and the efficacy of our methodologies. Introduction”

pdf bib
TERL: Transformer Enhanced Reinforcement Learning for Relation Extraction
Wang Yashen | Shi Tuo | Ouyang Xiaoye | Guo Dayu

“Relation Extraction (RE) task aims to discover the semantic relation that holds between two entitiesand contributes to many applications such as knowledge graph construction and completion. Reinforcement Learning (RL) has been widely used for RE task and achieved SOTA results, whichare mainly designed with rewards to choose the optimal actions during the training procedure,to improve RE’s performance, especially for low-resource conditions. Recent work has shownthat offline or online RL can be flexibly formulated as a sequence understanding problem andsolved via approaches similar to large-scale pre-training language modeling. To strengthen theability for understanding the semantic signals interactions among the given text sequence, thispaper leverages Transformer architecture for RL-based RE methods, and proposes a genericframework called Transformer Enhanced RL (TERL) towards RE task. Unlike prior RL-basedRE approaches that usually fit value functions or compute policy gradients, TERL only outputsthe best actions by utilizing a masked Transformer. Experimental results show that the proposedTERL framework can improve many state-of-the-art RL-based RE methods.”

pdf bib
P-MNER: Cross Modal Correction Fusion Network with Prompt Learning for Multimodal Named Entity Recognitiong
Wang Zhuang | Zhang Yijia | An Kang | Zhou Xiaoying | Lu Mingyu | Lin Hongfei

“Multimodal Named Entity Recognition (MNER) is a challenging task in social mediadue to the combination of text and image features. Previous MNER work has focused onpredicting entity information after fusing visual and text features. However, pre-traininglanguage models have already acquired vast amounts of knowledge during their pre-training process. To leverage this knowledge, we propose a prompt network for MNERtasks (P-MNER).To minimize the noise generated by irrelevant areas in the image, wedesign a visual feature extraction model (FRR) based on FasterRCNN and ResNet, whichuses fine-grained visual features to assist MNER tasks. Moreover, we introduce a textcorrection fusion module (TCFM) into the model to address visual bias during modalfusion. We employ the idea of a residual network to modify the fused features using theoriginal text features. Our experiments on two benchmark datasets demonstrate that ourproposed model outperforms existing MNER methods. P-MNER’s ability to leveragepre-training knowledge from language models, incorporate fine-grained visual features,and correct for visual bias, makes it a promising approach for multimodal named entityrecognition in social media posts.”

pdf bib
Self Question-answering: Aspect Sentiment Triplet Extraction via a Multi-MRC Framework based on Rethink Mechanism
Zhang Fuyao | Zhang Yijia | Wang Mengyi | Yang Hong | Lu Mingyu | Yang Liang

“The purpose of Aspect Sentiment Triplet Extraction (ASTE) is to extract a triplet, including thetarget or aspect, its associated sentiment, and related opinion terms that explain the underlyingcause of the sentiment. Some recent studies fail to capture the strong interdependence betweenATE and OTE, while others fail to effectively introduce the relationship between aspects andopinions into sentiment classification tasks. To solve these problems, we construct a multi-roundmachine reading comprehension framework based on a rethink mechanism to solve ASTE tasksefficiently. The rethink mechanism allows the framework to model complex relationships be-tween entities, and exclusive classifiers and probability generation algorithms can reduce queryconflicts and unilateral drops in probability. Besides, the multi-round structure can fuse explicitsemantic information flow between aspect, opinion and sentiment. Extensive experiments showthat the proposed model achieves the most advanced effect and can be effectively applied toASTE tasks.”

pdf bib
Enhancing Ontology Knowledge for Domain-Specific Joint Entity and Relation Extraction
Xiong Xiong | Wang Chen | Liu Yunfei | Li Shengyang

“Pre-trained language models (PLMs) have been widely used in entity and relation extractionmethods in recent years. However, due to the semantic gap between general-domain text usedfor pre-training and domain-specific text, these methods encounter semantic redundancy anddomain semantics insufficiency when it comes to domain-specific tasks. To mitigate this issue,we propose a low-cost and effective knowledge-enhanced method to facilitate domain-specificsemantics modeling in joint entity and relation extraction. Precisely, we use ontology and entitytype descriptions as domain knowledge sources, which are encoded and incorporated into thedownstream entity and relation extraction model to improve its understanding of domain-specificinformation. We construct a dataset called SSUIE-RE for Chinese entity and relation extractionin space science and utilization domain of China Manned Space Engineering, which contains awealth of domain-specific knowledge. The experimental results on SSUIE-RE demonstrate theeffectiveness of our method, achieving a 1.4% absolute improvement in relation F1 score overprevious best approach. Introduction”

pdf bib
Document Information Extraction via Global Tagging
He Shaojie | Wang Tianshu | Lu Yaojie | Lin Hongyu | Han Xianpei | Sun Yingfei | Sun Le

“Document Information Extraction (DIE) is a crucial task for extracting key information fromvisually-rich documents. The typical pipeline approach for this task involves Optical Charac-ter Recognition (OCR), serializer, Semantic Entity Recognition (SER), and Relation Extraction(RE) modules. However, this pipeline presents significant challenges in real-world scenariosdue to issues such as unnatural text order and error propagation between different modules. Toaddress these challenges, we propose a novel tagging-based method – Global TaggeR (GTR),which converts the original sequence labeling task into a token relation classification task. Thisapproach globally links discontinuous semantic entities in complex layouts, and jointly extractsentities and relations from documents. In addition, we design a joint training loss and a jointdecoding strategy for SER and RE tasks based on GTR. Our experiments on multiple datasetsdemonstrate that GTR not only mitigates the issue of text in the wrong order but also improvesRE performance. Introduction”

pdf bib
A Distantly-Supervised Relation Extraction Method Based on Selective Gate and Noise Correction
Chen Zhuowei | Tian Yujia | Wang Lianxi | Jiang Shengyi

“Entity relation extraction, as a core task of information extraction, aims to predict the relation ofentity pairs identified by text, and its research results are applied to various fields. To addressthe problem that current distantly supervised relation extraction (DSRE) methods based on large-scale corpus annotation generate a large amount of noisy data, a DSRE method that incorporatesselective gate and noise correction framework is proposed. The selective gate is used to reason-ably select the sentence features in the sentence bag, while the noise correction is used to correctthe labels of small classes of samples that are misclassified into large classes during the modeltraining process, to reduce the negative impact of noisy data on relation extraction. The resultson the English datasets clearly demonstrate that our proposed method outperforms other base-line models. Moreover, the experimental results on the Chinese dataset indicate that our methodsurpasses other models, providing further evidence that our proposed method is both robust andeffective.”

pdf bib
Improving Cascade Decoding with Syntax-aware Aggregator and Contrastive Learning for Event Extraction
Sheng Zeyu | Liang Yuanyuan | Lan Yunshi

“Cascade decoding framework has shown superior performance on event extraction tasks. How-ever, it treats a sentence as a sequence and neglects the potential benefits of the syntactic struc-ture of sentences. In this paper, we improve cascade decoding with a novel module and a self-supervised task. Specifically, we propose a syntax-aware aggregator module to model the syntaxof a sentence based on cascade decoding framework such that it captures event dependencies aswell as syntactic information. Moreover, we design a type discrimination task to learn better syn-tactic representations of different event types, which could further boost the performance of eventextraction. Experimental results on two widely used event extraction datasets demonstrate thatour method could improve the original cascade decoding framework by up to 2.2% percentagepoints of F1 score and outperform a number of competitive baseline methods. Introduction”

pdf bib
Learnable Conjunction Enhanced Model for Chinese Sentiment Analysis
Zhao Bingfei | Zan Hongying | Wang Jiajia | Han Yingjie

“Sentiment analysis is a crucial text classification task that aims to extract, process, and analyzeopinions, sentiments, and subjectivity within texts. In current research on Chinese text, sentenceand aspect-based sentiment analysis is mainly tackled through well-designed models. However,despite the importance of word order and function words as essential means of semantic ex-pression in Chinese, they are often underutilized. This paper presents a new Chinese sentimentanalysis method that utilizes a Learnable Conjunctions Enhanced Model (LCEM). The LCEMadjusts the general structure of the pre-trained language model and incorporates conjunctionslocation information into the model’s fine-tuning process. Additionally, we discuss a variantstructure of residual connections to construct a residual structure that can learn critical informa-tion in the text and optimize it during training. We perform experiments on the public datasetsand demonstrate that our approach enhances performance on both sentence and aspect-basedsentiment analysis datasets compared to the baseline pre-trained language models. These resultsconfirm the effectiveness of our proposed method. Introduction”

pdf bib
Improving Affective Event Classification with Multi-Perspective Knowledge Injection
Yi Wenjia | Zhao Yanyan | Yuan Jianhua | Zhao Weixiang | Qin Bing

“In recent years, many researchers have recognized the importance of associating events withsentiments. Previous approaches focus on generalizing events and extracting sentimental in-formation from a large-scale corpus. However, since context is absent and sentiment is oftenimplicit in the event, these methods are limited in comprehending the semantics of the eventand capturing effective sentimental clues. In this work, we propose a novel Multi-perspectiveKnowledge-injected Interaction Network (MKIN) to fully understand the event and accuratelypredict its sentiment by injecting multi-perspective knowledge. Specifically, we leverage con-texts to provide sufficient semantic information and perform context modeling to capture thesemantic relationships between events and contexts. Moreover, we also introduce human emo-tional feedback and sentiment-related concepts to provide explicit sentimental clues from theperspective of human emotional state and word meaning, filling the reasoning gap in the senti-ment prediction process. Experimental results on the gold standard dataset show that our modelachieves better performance over the baseline models.”

pdf bib
Enhancing Implicit Sentiment Learning via the Incorporation of Part-of-Speech for Aspect-based Sentiment Analysis
Wang Junlang | Li Xia | He Junyi | Zheng Yongqiang | Ma Junteng

“Implicit sentiment modeling in aspect-based sentiment analysis is a challenging problem due tocomplex expressions and the lack of opinion words in sentences. Recent efforts focusing onimplicit sentiment in ABSA mostly leverage the dependency between aspects and pretrain onextra annotated corpora. We argue that linguistic knowledge can be incorporated into the modelto better learn implicit sentiment knowledge. In this paper, we propose a PLM-based, linguis-tically enhanced framework by incorporating Part-of-Speech (POS) for aspect-based sentimentanalysis. Specifically, we design an input template for PLMs that focuses on both aspect-relatedcontextualized features and POS-based linguistic features. By aligning with the representationsof the tokens and their POS sequences, the introduced knowledge is expected to guide the modelin learning implicit sentiment by capturing sentiment-related information. Moreover, we alsodesign an aspect-specific self-supervised contrastive learning strategy to optimize aspect-basedcontextualized representation construction and assist PLMs in concentrating on target aspects. Experimental results on public benchmarks show that our model can achieve competitive andstate-of-the-art performance without introducing extra annotated corpora.”

pdf bib
Case Retrieval for Legal Judgment Prediction in Legal Artificial Intelligence
Zhang Han | Dou Zhicheng

“Legal judgment prediction (LJP) is a basic task in legal artificial intelligence. It consists ofthree subtasks, which are relevant law article prediction, charge prediction and term of penaltyprediction, and gives the judgment results to assist the work of judges. In recent years, many deeplearning methods have emerged to improve the performance of the legal judgment prediction task. The previous methods mainly improve the performance by integrating law articles and the factdescription of a legal case. However, they rarely consider that the judges usually look up historicalcases before making a judgment in the actual scenario. To simulate this scenario, we propose ahistorical case retrieval framework for the legal judgment prediction task. Specifically, we selectsome historical cases which include all categories from the training dataset. Then, we retrieve themost similar Top-k historical cases of the current legal case and use the vector representation ofthese Top-k historical cases to help predict the judgment results. On two real-world legal datasets,our model achieves better results than several state-of-the-art baseline models.”

pdf bib
SentBench: Comprehensive Evaluation of Self-Supervised Sentence Representation with Benchmark Construction
Liu Xiaoming | Lin Hongyu | Han Xianpei | Sun Le

“Self-supervised learning has been widely used to learn effective sentence representations. Previ-ous evaluation of sentence representations mainly focuses on the limited combination of tasks andparadigms while failing to evaluate their effectiveness in a wider range of application scenarios. Such divergences prevent us from understanding the limitations of current sentence representa-tions, as well as the connections between learning approaches and downstream applications. Inthis paper, we propose SentBench, a new comprehensive benchmark to evaluate sentence repre-sentations. SentBench covers 12 kinds of tasks and evaluates sentence representations with threetypes of different downstream application paradigms. Based on SentBench, we re-evaluate sev-eral frequently used self-supervised sentence representation learning approaches. Experimentsshow that SentBench can effectively evaluate sentence representations from multiple perspec-tives, and the performance on SentBench leads to some novel findings which enlighten futureresearches.”

pdf bib
Adversarial Network with External Knowledge for Zero-Shot Stance Detection
Wang Chunling | Zhang Yijia | Yu Xingyu | Liu Guantong | Chen Fei | Lin Hongfei

“Zero-shot stance detection intends to detect previously unseen targets’ stances in the testingphase. However, achieving this goal can be difficult, as it requires minimizing the domain trans-fer between different targets, and improving the model’s inference and generalization abilities. To address this challenge, we propose an adversarial network with external knowledge (ANEK)model. Specifically, we adopt adversarial learning based on pre-trained models to learn transfer-able knowledge from the source targets, thereby enabling the model to generalize well to unseentargets. Additionally, we incorporate sentiment information and common sense knowledge intothe contextual representation to further enhance the model’s understanding. Experimental re-sults on several datasets reveal that our method achieves excellent performance, demonstratingits validity and feasibility.”

pdf bib
The Contextualized Representation of Collocation
Liu Daohuan | Tang Xuri

“Collocate list and collocation network are two widely used representation methods of colloca-tions, but they have significant weaknesses in representing contextual information. To solve thisproblem, we propose a new representation method, namely the contextualized representation ofcollocate (CRC), which highlights the importance of the position of the collocates and pins acollocate as the interaction of two dimensions: association strength and co-occurrence position. With a full image of all the collocates surrounding the node word, CRC carries the contextualinformation and makes the representation more informative and intuitive. Through three casestudies, i.e., synonym distinction, image analysis, and efficiency in lexical use, we demonstratethe advantages of CRC in practical applications. CRC is also a new quantitative tool to measurelexical usage pattern similarities for corpus-based research. It can provide a new representationframework for language researchers and learners.”

pdf bib
Training NLI Models Through Universal Adversarial Attack
Lin Jieyu | Liu Wei | Zou Jiajie | Ding Nai

“Pre-trained language models are sensitive to adversarial attacks, and recent works have demon-strated universal adversarial attacks that can apply input-agnostic perturbations to mislead mod-els. Here, we demonstrate that universal adversarial attacks can also be used to harden NLPmodels. Based on NLI task, we propose a simple universal adversarial attack that can misleadmodels to produce the same output for all premises by replacing the original hypothesis with anirrelevant string of words. To defend against this attack, we propose Training with UNiversalAdversarial Samples (TUNAS), which iteratively generates universal adversarial samples andutilizes them for fine-tuning. The method is tested on two datasets, i.e., MNLI and SNLI. It isdemonstrated that, TUNAS can reduce the mean success rate of the universal adversarial attackfrom above 79% to below 5%, while maintaining similar performance on the original datasets. Furthermore, TUNAS models are also more robust to the attack targeting at individual samples:When search for hypotheses that are best entailed by a premise, the hypotheses found by TUNASmodels are more compatible with the premise than those found by baseline models. In sum, weuse universal adversarial attack to yield more robust models. Introduction”

pdf bib
MCLS: A Large-Scale Multimodal Cross-Lingual Summarization Dataset
Shi Xiaorui

“Multimodal summarization which aims to generate summaries with multimodal inputs, e.g., textand visual features, has attracted much attention in the research community. However, previousstudies only focus on monolingual multimodal summarization and neglect the non-native readerto understand the cross-lingual news in practical applications. It inspires us to present a newtask, named Multimodal Cross-Lingual Summarization for news (MCLS), which generates cross-lingual summaries from multi-source information. To this end, we present a large-scale multimodalcross-lingual summarization dataset, which consists of 1.1 million article-summary pairs with 3.4million images in 44 * 43 language pairs. To generate a summary in any language, we propose aunified framework that jointly trains the multimodal monolingual and cross-lingual summarizationtasks, where a bi-directional knowledge distillation approach is designed to transfer knowledgebetween both tasks. Extensive experiments on many-to-many settings show the effectiveness ofthe proposed model.”

pdf bib
CHED: A Cross-Historical Dataset with a Logical Event Schema for Classical Chinese Event Detection
Wei Congcong | Feng Zhenbing | Huang Shutan | Li Wei | Shao Yanqiu

“Event detection (ED) is a crucial area of natural language processing that automates the extrac-tion of specific event types from large-scale text, and studying historical ED in classical Chinesetexts helps preserve and inherit historical and cultural heritage by extracting valuable informa-tion. However, classical Chinese language characteristics, such as ambiguous word classes andcomplex semantics, have posed challenges and led to a lack of datasets and limited research onevent schema construction. In addition, large-scale datasets in English and modern Chinese arenot directly applicable to historical ED in classical Chinese. To address these issues, we con-structed a logical event schema for classical Chinese historical texts and annotated the resultingdataset, which is called classical Chinese Historical Event Dataset (CHED). The main challengesin our work on classical Chinese historical ED are accurately identifying and classifying eventswithin cultural and linguistic contexts and addressing ambiguity resulting from multiple mean-ings of words in historical texts. Therefore, we have developed a set of annotation guidelinesand provided annotators with an objective reference translation. The average Kappa coefficientafter multiple cross-validation is 68.49%, indicating high quality and consistency. We conductedvarious tasks and comparative experiments on established baseline models for historical ED inclassical Chinese. The results showed that BERT+CRF had the best performance on sequencelabeling task, with an f1-score of 76.10%, indicating potential for further improvement. 1Introduction”

pdf bib
Revisiting k-NN for Fine-tuning Pre-trained Language Models
Li Lei | Chen Jing | Tian Botzhong | Zhang Ningyu

“Pre-trained Language Models (PLMs), as parametric-based eager learners, have become thede-facto choice for current paradigms of Natural Language Processing (NLP). In contrast, k-Nearest-Neighbor (k-NN) classifiers, as the lazy learning paradigm, tend to mitigate over-fittingand isolated noise. In this paper, we revisit k-NN classifiers for augmenting the PLMs-based clas-sifiers. From the methodological level, we propose to adopt k-NN with textual representationsof PLMs in two steps: (1) Utilize k-NN as prior knowledge to calibrate the training process.(2) Linearly interpolate the probability distribution predicted by k-NN with that of the PLMs’classifier. At the heart of our approach is the implementation of k-NN-calibrated training, whichtreats predicted results as indicators for easy versus hard examples during the training process. From the perspective of the diversity of application scenarios, we conduct extensive experimentson fine-tuning, prompt-tuning paradigms and zero-shot, few-shot and fully-supervised settings,respectively, across eight diverse end-tasks. We hope our exploration will encourage the commu-nity to revisit the power of classical methods for efficient NLP1.”

pdf bib
Adder Encoder for Pre-trained Language Model
Ding Jianbang | Zhang Suiyun | Li Linlin

“BERT, a pre-trained language model entirely based on attention, has proven to be highly per-formant for many natural language understanding tasks. However, pre-trained language mod-els (PLMs) are often computationally expensive and can hardly be implemented with limitedresources. To reduce energy burden, we introduce adder operations into the Transformer en-coder and propose a novel AdderBERT with powerful representation capability. Moreover, weadopt mapping-based distillation to further improve its energy efficiency with an assured perfor-mance. Empirical results demonstrate that AddderBERT6 achieves highly competitive perfor-mance against that of its teacher BERTBASE on the GLUE benchmark while obtaining a 4.9xreduction in energy consumption.”

pdf bib
FinBART: A Pre-trained Seq2seq Language Model for Chinese Financial Tasks
Dong Hongyuan | Che Wanxiang | He Xiaoyu | Zheng Guidong | Wen Junjie

“Pretrained language models are making a more profound impact on our lives than ever before. They exhibit promising performance on a variety of general domain Natural Language Process-ing (NLP) tasks. However, few work focuses on Chinese financial NLP tasks, which comprisea significant portion of social communication. To this end, we propose FinBART, a pretrainedseq2seq language model for Chinese financial communication tasks. Experiments show thatFinBART outperforms baseline models on a series of downstream tasks including text classifica-tion, sequence labeling and text generation. We further pretrain the model on customer servicecorpora, and results show that our model outperforms baseline models and achieves promisingperformance on various real world customer service text mining tasks.”

pdf bib
Exploring Accurate and Generic Simile Knowledge from Pre-trained Language Models
Zhou Shuhan | Ma Longxuan | Shao Yanqiu

“A simile is an important linguistic phenomenon in daily communication and an important taskin natural language processing (NLP). In recent years, pre-trained language models (PLMs) haveachieved great success in NLP since they learn generic knowledge from a large corpus. However,PLMs still have hallucination problems that they could generate unrealistic or context-unrelatedinformation.In this paper, we aim to explore more accurate simile knowledge from PLMs.To this end, we first fine-tune a single model to perform three main simile tasks (recognition,interpretation, and generation). In this way, the model gains a better understanding of the simileknowledge. However, this understanding may be limited by the distribution of the training data. To explore more generic simile knowledge from PLMs, we further add semantic dependencyfeatures in three tasks. The semantic dependency feature serves as a global signal and helpsthe model learn simile knowledge that can be applied to unseen domains. We test with seenand unseen domains after training. Automatic evaluations demonstrate that our method helps thePLMs to explore more accurate and generic simile knowledge for downstream tasks. Our methodof exploring more accurate knowledge is not only useful for simile study but also useful for otherNLP tasks leveraging knowledge from PLMs. Our code and data will be released on GitHub.”