基于词向量的自适应领域术语抽取方法(An Adaptive Domain-Specific Terminology Extraction Approach Based on Word Embedding)

Xi Tang (唐溪), Dongchen Jiang (蒋东辰), Aoyuan Jiang (蒋翱远)


Abstract
“术语分布呈现长尾特性。为了有效提取低频术语,本文提出了一种基于词向量的自适应术语抽取方法。该方法使用基于假设检验的统计方法,自适应地确定筛选阈值,通过逐步合并文本的强关联性字符串获得候选术语,避免了因固定阈值导致的低频术语遗漏问题;其后,本文基于掩码语言模型获得未登录候选术语的词向量,并通过融合词典知识的密度聚类算法获得候选术语归属的领域簇,将归属于目标领域簇的候选术语认定为领域术语。实验结果表明,我们的方法不仅在但值上优于对比方法,而且在不同体裁的文本中表现更为稳定。该方法能够全面有效地抽取出低频术语,实现领域术语的高质量提取。”
Anthology ID:
2023.ccl-1.17
Volume:
Proceedings of the 22nd Chinese National Conference on Computational Linguistics
Month:
August
Year:
2023
Address:
Harbin, China
Editors:
Maosong Sun, Bing Qin, Xipeng Qiu, Jing Jiang, Xianpei Han
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
186–195
Language:
Chinese
URL:
https://aclanthology.org/2023.ccl-1.17
DOI:
Bibkey:
Cite (ACL):
Xi Tang, Dongchen Jiang, and Aoyuan Jiang. 2023. 基于词向量的自适应领域术语抽取方法(An Adaptive Domain-Specific Terminology Extraction Approach Based on Word Embedding). In Proceedings of the 22nd Chinese National Conference on Computational Linguistics, pages 186–195, Harbin, China. Chinese Information Processing Society of China.
Cite (Informal):
基于词向量的自适应领域术语抽取方法(An Adaptive Domain-Specific Terminology Extraction Approach Based on Word Embedding) (Tang et al., CCL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.ccl-1.17.pdf