Mingwen Wang (王明文)

Mingwen Wang

Also published as: Ming-Wei Wang, MingWen Wang, 明文王

2024

基于两阶段提示学习的少样本命名实体识别(Two-Stage Prompt Learning for Few-Shot Named Entity Recognition)
Jiaxing Shao (邵佳兴) | Qi Huang (黄琪) | Cong Xiao (肖聪) | Jing Liu (刘璟) | Wenbing Luo (罗文兵) | Mingwen Wang (王明文)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“少样本命名实体识别旨在用少量的标注数据来识别命名实体。近年来受提示学习在少样本场景中表现良好性能的启发,本文探索了基于提示的少样本命名实体识别的方法。已有的基于提示学习的方法是通过列举所有可能的跨度来进行实体识别,这导致了计算成本高以及对实体边界信息未充分利用的问题。本文提出一种基于提示学习的两阶段框架TSP-Few,在不使用源域数据的情况下,进行少样本命名实体识别。第一阶段对种子跨度进行增强、过滤和扩展,其中种子增强模块能够让种子跨度捕获到更丰富的语义信息,种子过滤器能够减少大量的无关跨度,种子扩展模块能够充分利用实体的边界信息,为实体类型分类提供高质量的候选实体跨度。第二阶段利用提示学习方法预测候选跨度的相应类别。此外,为了缓解跨度检测阶段的错误累积,在实体分类阶段引入了负采样策略。跨度检测和实体类型分类任务的独立训练更容易在少样本情况下取得优异的性能。在三个基准数据集上的实验表明,与先进的方法相比,本文提出的方法在性能上有了进一步的提升,并且实验结果也表明了该文模型各个模块的有效性。”

pdf bib abs

基于文本风格迁移的中文性别歧视文本去毒研究(Research on detoxification of Chinese sexist texts based on text style transfer)
Jian Peng (彭健) | Jiali Zuo (左家莉) | Jingxuan Tan (谭景璇) | Jianyi Wan (万剑怡) | Mingwen Wang (王明文)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“网络社交媒体平台存在一定程度的性别歧视言论,阻碍了互联网健康和社会文明发展。文本风格迁移技术可以减轻文本中的性别歧视,在英语等语言上已有不少研究。但在中文领域,由于缺乏数据集而导致相关研究较少。此外,由于中文语义信息丰富、语言表达多样而导致性别歧视言论毒性的表现形式多样,现有的方法多采用单一文本风格迁移模型因而效果不佳。因此,本文提出了一个基于文本风格迁移的中文性别歧视文本去毒框架,该框架首先根据毒性的表现形式对文本进行分类,进而根据文本毒性表现形式的不同采用不同的处理方式,我们还引入了大语言模型(LLM)构建歧视词词典。实验表明,本文提出的模型能有效地处理中文文本中的性别歧视问题。”

pdf bib abs

PPDAC: A Plug-and -Play Data Augmentation Component for Few-shot Extractive Question Answering
Qi Huang | Han Fu | Wenbin Luo | Mingwen Wang | Kaiwei Luo
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“Extractive Question Answering (EQA) in the few-shot learning scenario is one of the most chal-lenging tasks of Machine Reading Comprehension (MRC). Some previous works employ exter-nal knowledge for data augmentation to improve the performance of few-shot extractive ques-tion answering. However, there are not always available external knowledge or language- anddomain-specific NLP tools to deal with external knowledge such as part-of-speech taggers, syn-tactic parsers, and named-entity recognizers. In this paper, we present a novel Plug-and-PlayData Augmentation Component (PPDAC) for the few-shot extractive question answering, whichincludes a paraphrase generator and a paraphrase selector. Specifically, we generate multipleparaphrases of the question in the (question, passage, answer) triples using the paraphrase gener-ator and then obtain highly similar statements via paraphrase selector to form more training datafor fine-tuning. Extensive experiments on multiple EQA datasets show that our proposed plug-and-play data augmentation component significantly improves question-answering performance,and consistently outperforms state-of-the-art approaches in few-shot settings by a large margin.”

pdf bib abs

基于增量预训练与外部知识的古文历史事件检测
Wenjun Kang (康文军) | Jiali Zuo (左家莉) | Yiyu Hu (胡益裕) | Mingwen Wang (王明文)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“古文历史事件检测任务旨在识别文本中的事件触发词和类型。为了解决传统pipeline方法容易产生级联错误传播,以及大多数事件检测方法仅依赖句子层面信息的问题,本文提出了一种结合外部信息和全局对应矩阵的联合抽取模型EIGC,以实现触发词和事件类型的精确抽取。此外,本文还整理了一个包含“二十四史”等古汉语文献的数据集,共计约97万条古汉语文本,并利用该文本对BERT-Ancient-Chinese进行增量预训练。最终,本文所提出的模型在三个任务上的总F1值达到了76.2%,验证了该方法的有效性。”

2023

pdf bib abs

融合词典信息的古籍命名实体识别研究(A Study on the Recognition of Named Entities of Ancient Books Using Lexical Information)
Wenjun Kang (康文军) | Jiali Zuo (左家莉) | Anquan Jie (揭安全) | Wenbin Luo (罗文兵) | Mingwen Wang (王明文)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“古籍命名实体识别对于古籍实体知识库与语料库的建设具有显著的现实意义。目前古籍命名实体识别的研究较少,主要原因是缺乏足够的训练语料。本文从《资治通鉴》入手,人工构建了一份古籍命名实体识别数据集,以此展开对古籍命名实体识别任务的研究。针对古籍文本多以单字表意且存在大量省略的语言特点,本文采用预训练词向量作为词典信息,充分利用其中蕴涵的词汇信息。实验表明,这种方法可以有效处理古籍文本中人名实体识别的问题。”

pdf bib abs

结合全局对应矩阵和相对位置信息的古汉语实体关系联合抽取(Joint Extraction of Ancient Chinese Entity Relations by Combining Global Correspondence Matrix and Relative Position Information)
Yiyu Hu (胡益裕) | Jiali Zuo (左家莉) | Xueqiang Ceng (曾雪强) | Zhongying Wan (万中英) | Mingwen Wang (王明文)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“实体关系抽取是信息抽取领域中一项重要任务,目前实体关系抽取任务主要聚焦于英文和现代汉语领域,关于古汉语领域的数据集构建和方法的研究目前却较少。针对这一问题,本文在研究了开源的《资治通鉴》语料后,人工构建了一个古汉语实体关系数据集,并设计了一种结合全局对应矩阵和相对位置信息的实体关系联合抽取方法。最后通过在本文构建的数据集上进行实验,证明了该方法在古汉语实体关系抽取任务上的有效性。”

pdf bib abs

Rumors spread rapidly through online social microblogs at a relatively low cost, causing substantial economic losses and negative consequences in our daily lives. Existing rumor detection models often neglect the underlying semantic coherence between text and image components in multimodal posts, as well as the challenges posed by incomplete modalities in single modal posts, such as missing text or images. This paper presents CLKD-IMRD, a novel framework for Incomplete Modality Rumor Detection. CLKD-IMRD employs Contrastive Learning and Knowledge Distillation to capture the semantic consistency between text and image pairs, while also enhancing model generalization to incomplete modalities within individual posts. Extensive experimental results demonstrate that our CLKD-IMRD outperforms state-of-the-art methods on two English and two Chinese benchmark datasets for rumor detection in social media.

2021

pdf bib abs

融合XLM词语表示的神经机器译文自动评价方法(Neural Automatic Evaluation of Machine Translation Method Combined with XLM Word Representation)
Wei Hu (胡纬) | Maoxi Li (李茂西) | Bailian Qiu (裘白莲) | Mingwen Wang (王明文)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

机器译文自动评价对机器翻译的发展和应用起着重要的促进作用,它一般通过计算机器译文和人工参考译文的相似度来度量机器译文的质量。该文通过跨语种预训练语言模型XLM将源语言句子、机器译文和人工参考译文映射到相同的语义空间,结合分层注意力和内部注意力提取源语言句子与机器译文、机器译文与人工参考译文以及源语言句子与人工参考译文之间差异特征,并将其融入到基于Bi-LSTM神经译文自动评价方法中。在WMT’19译文自动评价数据集上的实验结果表明,融合XLM词语表示的神经机器译文自动评价方法显著提高了其与人工评价的相关性。

pdf bib abs

基于自动识别的委婉语历时性发展变化与社会共变研究(A Study on the Diachronic Development and Social Covariance of Euphemism Based on Automatic Recognition)
Chenlin Zhang (张辰麟) | Mingwen Wang (王明文) | Yiming Tan (谭亦鸣) | Ming Yin (尹明) | Xinyi Zhang (张心怡)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

本文主要以汉语委婉语作为研究对象,基于大量人工标注,借助机器学习有监督分类方法,实现了较高精度的委婉语自动识别,并基于此对1946年-2017年的《人民日报》中的委婉语历时变化发展情况进行量化统计分析。从大规模数据的角度探讨委婉语历时性发展变化、委婉语与社会之间的共变关系,验证了语言的格雷什姆规律与更新规律。

2020

pdf bib abs

“细粒度英汉机器翻译错误分析语料库”的构建与思考(Construction of Fine-Grained Error Analysis Corpus of English-Chinese Machine Translation and Its Implications)
Bailian Qiu (裘白莲) | Mingwen Wang (王明文) | Maoxi Li (李茂西) | Cong Chen (陈聪) | Fan Xu (徐凡)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

机器翻译错误分析旨在找出机器译文中存在的错误,包括错误类型、错误分布等,它在机器翻译研究和应用中起着重要作用。该文将人工译后编辑与错误分析结合起来,对译后编辑操作进行错误标注,采用自动标注和人工标注相结合的方法,构建了一个细粒度英汉机器翻译错误分析语料库,其中每一个标注样本包括源语言句子、机器译文、人工参考译文、译后编辑译文、词错误率和错误类型标注;标注的错误类型包括增词、漏词、错词、词序错误、未译和命名实体翻译错误等。标注的一致性检验表明了标注的有效性;对标注语料的统计分析结果能有效地指导机器翻译系统的开发和人工译员的后编辑。