古汉语词义标注语料库的构建及应用研究(The Construction and Application of Ancient Chinese Corpus with Word Sense Annotation)

Lei Shu (舒蕾), Yiluan Guo (郭懿鸾), Huiping Wang (王慧萍), Xuetao Zhang (张学涛), Renfen Hu (胡韧奋)


Abstract
古汉语以单音节词为主,其一词多义现象十分突出,这为现代人理解古文含义带来了一定的挑战。为了更好地实现古汉语词义的分析和判别,本研究基于传统辞书和语料库反映的语言事实,设计了针对古汉语多义词的词义划分原则,并对常用古汉语单音节词进行词义级别的知识整理,据此对包含多义词的语料开展词义标注。现有的语料库包含3.87万条标注数据,规模超过117.6万字,丰富了古代汉语领域的语言资源。实验显示,基于该语料库和BERT语言模型,词义判别算法准确率达到80%左右。进一步地,本文以词义历时演变分析和义族归纳为案例,初步探索了语料库与词义消歧技术在语言本体研究和词典编撰等领域的应用。
Anthology ID:
2021.ccl-1.50
Volume:
Proceedings of the 20th Chinese National Conference on Computational Linguistics
Month:
August
Year:
2021
Address:
Huhhot, China
Editors:
Sheng Li (李生), Maosong Sun (孙茂松), Yang Liu (刘洋), Hua Wu (吴华), Kang Liu (刘康), Wanxiang Che (车万翔), Shizhu He (何世柱), Gaoqi Rao (饶高琦)
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
549–563
Language:
Chinese
URL:
https://aclanthology.org/2021.ccl-1.50
DOI:
Bibkey:
Cite (ACL):
Lei Shu, Yiluan Guo, Huiping Wang, Xuetao Zhang, and Renfen Hu. 2021. 古汉语词义标注语料库的构建及应用研究(The Construction and Application of Ancient Chinese Corpus with Word Sense Annotation). In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 549–563, Huhhot, China. Chinese Information Processing Society of China.
Cite (Informal):
古汉语词义标注语料库的构建及应用研究(The Construction and Application of Ancient Chinese Corpus with Word Sense Annotation) (Shu et al., CCL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.ccl-1.50.pdf