2024
pdf
bib
abs
MTA4DPR: Multi-Teaching-Assistants Based Iterative Knowledge Distillation for Dense Passage Retrieval
Qixi Lu
|
Endong Xun
|
Gongbo Tang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Although Dense Passage Retrieval (DPR) models have achieved significantly enhanced performance, their widespread application is still hindered by the demanding inference efficiency and high deployment costs. Knowledge distillation is an efficient method to compress models, which transfers knowledge from strong teacher models to weak student models. Previous studies have proved the effectiveness of knowledge distillation in DPR. However, there often remains a significant performance gap between the teacher and the distilled student. To narrow this performance gap, we propose MTA4DPR, a Multi-Teaching-Assistants based iterative knowledge distillation method for Dense Passage Retrieval, which transfers knowledge from the teacher to the student with the help of multiple assistants in an iterative manner; with each iteration, the student learns from more performant assistants and more difficult data. The experimental results show that our 66M student model achieves the state-of-the-art performance among models with same parameters on multiple datasets, and is very competitive when compared with larger, even LLM-based, DPR models.
2021
pdf
bib
abs
基于结构树库的状位动词语义分类及搭配库构建(Semantic Classification of Adverbial Verbs Based on Structure Tree Database and Construction of Collocation Database)
Tian Shao (邵田)
|
Shiquan Zhai (翟世权)
|
Gaoqi Rao (饶高琦)
|
Endong Xun (荀恩东)
Proceedings of the 20th Chinese National Conference on Computational Linguistics
一般情况下,一个小句中只有一个动词,但是也有两个动词同时在一个小句中出现的情况,比如两个动词接连出现在同一小句中,在句法上有可能构成状中、述补、动宾、连谓及并列等结构,语义上可能表示修饰、支配、并列等关系。连续使用的两个动词构成了相对复杂的结构与语义关系,尤其是在没有形式标记的情况下,如何自动识别连用动词的结构及其所表达的语义关系是句法语义分析在落地过程中面对的较为困难的问题。对此,本文将研究对象定位于直接作状语的动词,从大规模结构树库中抽取两个动词连用的情况,并对语料进行消歧,提取出作状语的动词后,进一步对其进行语义的细分类,最后构建相应的语义搭配库。不仅为语言学本体提供了分类参考,同时也为深层次的汉语句法语义分析提供了更多的知识。
pdf
bib
abs
基于结构检索的汉语介动搭配知识库构建(Construction of Preposition-verb Knowledge Base Based on Structure Retrieval)
Chengwen Wang (王诚文)
|
Gaoqi Rao (饶高琦)
|
Endong Xun (荀恩东)
Proceedings of the 20th Chinese National Conference on Computational Linguistics
以往的介词知识库构建重视介词语义和介宾的搭配研究,鲜有对介动搭配进行系统研究及知识获取的工作。而汉语介词发达及动词是句子中心的特征决定了介动搭配研究的重要性。本研究基于结构检索技术,充分借助短语结构属性和结构信息,从大规模语料中抽取介动搭配16033对。并提出了介动搭配紧密度的度量方法,初步分析证明其远优于依靠绝对频次进行搭配度量的方法。
2020
pdf
bib
abs
基于组块分析的汉语块依存语法(Chinese Chunk-Based Dependency Grammar)
Qingqing Qian (钱青青)
|
Chengwen Wang (王诚文)
|
Gaoqi Rao (饶高琦)
|
Endong Xun (荀恩东)
Proceedings of the 19th Chinese National Conference on Computational Linguistics
基于词单元的经典依存语法在面向中文的句子分析中遇到诸多汉语特性引起的困难。为此,本文提出汉语的块依存语法,以谓词为核心,以组块为研究对象,在句内和句间寻找谓词所支配的组块,构建句群级别的句法分析框架。这一操作不仅仅是提升叶子节点的语言单位,而且还针对汉语语义特点进行了分析方式和分析规则上的创新,能够较好地解决微观层次的逻辑结构知识,并为中观论元知识和宏观篇章知识打好铺垫。本文主要介绍了块依存语法理念、表示、分析方法及特点,并简要介绍了块依存树库的构建情况。截至目前为止,树库规模为187万字符(超过4万复句、10万小句),其中包含67%新闻文本和32%百科文本。
pdf
bib
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications
Erhong YANG
|
Endong XUN
|
Baolin ZHANG
|
Gaoqi RAO
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications
2018
pdf
bib
Distributed Representation of Chinese Collocation
Bo Xia
|
Endong Xun
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation
pdf
bib
abs
Overview of NLPTEA-2018 Share Task Chinese Grammatical Error Diagnosis
Gaoqi Rao
|
Qi Gong
|
Baolin Zhang
|
Endong Xun
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications
This paper presents the NLPTEA 2018 shared task for Chinese Grammatical Error Diagnosis (CGED) which seeks to identify grammatical error types, their range of occurrence and recommended corrections within sentences written by learners of Chinese as foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 20 teams registered for this shared task, 13 teams developed the system and submitted a total of 32 runs. Progress in system performances was obviously, reaching F1 of 36.12% in position level and 25.27% in correction level. All data sets with gold standards and scoring scripts are made publicly available to researchers.
2017
pdf
bib
abs
IJCNLP-2017 Task 1: Chinese Grammatical Error Diagnosis
Gaoqi Rao
|
Baolin Zhang
|
Endong Xun
|
Lung-Hao Lee
Proceedings of the IJCNLP 2017, Shared Tasks
This paper presents the IJCNLP 2017 shared task for Chinese grammatical error diagnosis (CGED) which seeks to identify grammatical error types and their range of occurrence within sentences written by learners of Chinese as foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 13 teams registered for this shared task, 5 teams developed the system and submitted a total of 13 runs. We expected this evaluation campaign could lead to the development of more advanced NLP techniques for educational applications, especially for Chinese error detection. All data sets with gold standards and scoring scripts are made publicly available to researchers.
2016
pdf
bib
abs
Overview of NLP-TEA 2016 Shared Task for Chinese Grammatical Error Diagnosis
Lung-Hao Lee
|
Gaoqi Rao
|
Liang-Chih Yu
|
Endong Xun
|
Baolin Zhang
|
Li-Ping Chang
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)
This paper presents the NLP-TEA 2016 shared task for Chinese grammatical error diagnosis which seeks to identify grammatical error types and their range of occurrence within sentences written by learners of Chinese as foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 15 teams registered for this shared task, 9 teams developed the system and submitted a total of 36 runs. We expected this evaluation campaign could lead to the development of more advanced NLP techniques for educational applications, especially for Chinese error detection. All data sets with gold standards and scoring scripts are made publicly available to researchers.
2000
pdf
bib
A Unified Statistical Model for the Identification of English BaseNP
Endong Xun
|
Changning Huang
|
Ming Zhou
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics
pdf
bib
PENS: A Machine-aided English Writing System for Chinese Users
Ting Liu
|
Ming Zhou
|
Jianfeng Gao
|
Endong Xun
|
Changning Huang
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics