Gaoqi Rao

Also published as: 高琦饶, Gaoqi RAO

2024

意合图:中文多层次语义表示方法∗(Parataxis Graph: Multi-level Semantic Representation Method for Chinese)
Mengxi Guo (郭梦溪) | Endong Xun (荀恩东) | Meng Li (李梦) | Gaoqi Rao (饶高琦)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“基于参数的语义表示虽取得成就,但符号化的语义表示仍具有不可忽视的意义。我们在语义学基础上,充分考虑符号化语义表示在NLP领域落地中的需求,提出了一种兼具通用性与扩展性的多层次语义表示方法——意合图。意合图以事件为核心,由事件结构与实体结构构成,通过多层次语义体系设计,提升与场景结合的能力,并力求对不同层级的语言单元作一贯式表示。在资源建设和相关分析实验中取得良好效果。本文将重点介绍意合图设计理念与多层次语义体系。”

pdf bib abs

“中文意合图是近年提出的中文语义表示方法。本次评测是首次基于意合图理论的语义分析评测,旨在探索面向意合图理论的语义计算方法,评估机器的语义分析能力。本次评测共有14支队伍报名,最终有7支队伍提交结果,其中有5支队伍提交技术报告与模型,均成功复现。在评测截止时间内,表现最好的队伍使用大语言模型LoRA微调方法获得了F1值为72.06%的成绩。在最终提交技术报告的5支队伍中,有4支队伍使用了大语言模型微调方法,在一定程度上表明了目前技术发展的趋势。”

pdf bib abs

基于意合图语义理论的结构标注体系与资源建设∗(System and Resource Construction Based on the Semantic Theory of Chinese-Parataxis-Graph)
Mengxi Guo (郭梦溪) | Meng Li (李梦) | Endong Xun (荀恩东) | Gaoqi Rao (饶高琦) | Zhongyang Yu (于钟洋)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“意合图是一种以事件为中心的多层次语义表示方法,由事件结构与实体结构构成,通过多层次语义体系设计,实现对事件的多层次分析。本文细化并制定了意合图标注规范,采用分层分级的标注策略,在自主研发的在线标注系统中对新闻语料和国际中文教育阅读语料进行了意合图QNP标注工作。通过本次标注,验证了意合图体系的合理性和可标注性,并构建了意合图语义资源库。”

2023

pdf bib abs

基于结构树库的补语位形容词语义分析及搭配库构建∗(Semantic analysis of complementary adjectives and construction of collocation database based on structural tree library)
Siyu Tian (思雨田) | Tian Shao (邵田) | Endong Xun (荀恩东) | Gaoqi Rao (饶高琦)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“在形容词充当补语的粘合式述补结构1中,通常以两个谓词性成分连用(”形容词+形容词”、“动词+形容词”)的形式出现,由于这一结构没有形式标记,为计算机自动识别该结构带来了较大的难度,同时,形容词充当补语并不是其最基本、典型(作定语、谓语)的用法,在语言学界与计算语言学界也没有受到足够的关注。因此,该文以补语位的形容词为研究对象,从大规模的句法结构树库中抽取形容词直接作补语的述补结构,并通过编程和人工校验的方式对语料进行降噪,对补语位形容词进行穷尽式检索,得到补语位形容词词表,进一步对补语位形容词的语义进行细分类,构建相应的语义搭配库。不仅可以提升句法切分的正确率,为深层句法语义分析提供语义信息,也可以为语言学本体的相关研究提供参考。”

pdf bib abs

“汉语学习者文本纠错(Chinese Learner Text Correction)评测比赛,是依托于第22届中国计算语言学大会举办的技术评测。针对汉语学习者文本,设置了多维度汉语学习者文本纠错和中文语法错误检测两个赛道。结合人工智能技术的不断进步和发展的时代背景,在两赛道下分别设置开放和封闭任务。开放任务允许使用大模型。以汉语学习者文本多维标注语料库YACLC为基础建设评测数据集,建立基于多参考答案的评价标准,构建基准评测框架,进一步推动汉语学习者文本纠错研究的发展。共38支队伍报名参赛,其中5支队伍成绩优异并提交了技术报告。”

2022

pdf bib abs

基于《同义词词林》的中文语体分类资源构建(Construction of Chinese register classification resources based on “Tongyici Cilin”)
Guojing Huang (黄国敬) | Liwei Zhou (周立炜) | Gaoqi Rao (饶高琦) | Jiaojiao Zang (臧娇娇)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“语体词是指在某一语体中专用的词语,是语体的语言要素和形式标记。而语体词的资源可以服务于与现实场景息息相关的NLP应用,但目前此类资源较为稀缺。对此,本文基于《大词林》,完成了“语体词标注”“语体(词)链条标注”和“平行构式标注”三个任务,建立了以语体词为基础的语体分类资源。本资源包含55,710条词语、5017个语体链条和433组平行构式。基于此本文分析中文语体词的分布概况、形态差异以及词义词性的分布情况。”

pdf bib

2021

pdf bib abs

基于结构树库的状位动词语义分类及搭配库构建(Semantic Classification of Adverbial Verbs Based on Structure Tree Database and Construction of Collocation Database)
Tian Shao (邵田) | Shiquan Zhai (翟世权) | Gaoqi Rao (饶高琦) | Endong Xun (荀恩东)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

一般情况下,一个小句中只有一个动词,但是也有两个动词同时在一个小句中出现的情况,比如两个动词接连出现在同一小句中,在句法上有可能构成状中、述补、动宾、连谓及并列等结构,语义上可能表示修饰、支配、并列等关系。连续使用的两个动词构成了相对复杂的结构与语义关系,尤其是在没有形式标记的情况下,如何自动识别连用动词的结构及其所表达的语义关系是句法语义分析在落地过程中面对的较为困难的问题。对此,本文将研究对象定位于直接作状语的动词,从大规模结构树库中抽取两个动词连用的情况,并对语料进行消歧,提取出作状语的动词后,进一步对其进行语义的细分类,最后构建相应的语义搭配库。不仅为语言学本体提供了分类参考,同时也为深层次的汉语句法语义分析提供了更多的知识。

pdf bib abs

汉语语体特征的计量与分类研究(A study on the measurement and classification of Chinese stylistic features)
Qinqing Tai (邰沁清) | Gaoqi Rao (饶高琦)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

本文运用语料库和统计方法对汉语语体进行特征的计量研究,并进一步实现自动分类任务。首先通过单因素方差分析描述语体特征区别不同语体的作用和功能。其次,选取其中具有区分度的语言要素拟合逻辑回归模型,量化语体表达形式并观察特征对语体构成的重要性,并通过聚类计算得到了语体的范畴分类体系。最后,以具有代表性的机器学习模型为分类器,挖掘不同组合特征的结构对于语体自动分类的影响。得出在“词2n+词类2n+标点符号2n+语言特征”的组合特征上,取得了最好的分类结果,随机森林模型达到97.25%的准确率。

pdf bib abs

基于结构检索的汉语介动搭配知识库构建(Construction of Preposition-verb Knowledge Base Based on Structure Retrieval)
Chengwen Wang (王诚文) | Gaoqi Rao (饶高琦) | Endong Xun (荀恩东)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

以往的介词知识库构建重视介词语义和介宾的搭配研究,鲜有对介动搭配进行系统研究及知识获取的工作。而汉语介词发达及动词是句子中心的特征决定了介动搭配研究的重要性。本研究基于结构检索技术,充分借助短语结构属性和结构信息,从大规模语料中抽取介动搭配16033对。并提出了介动搭配紧密度的度量方法,初步分析证明其远优于依靠绝对频次进行搭配度量的方法。

pdf bib

2020

pdf bib abs

中文问句的形式分类和资源建设(Formal classification and resource construction of Chinese questions)
Jiangtao Li (黎江涛) | Gaoqi Rao (饶高琦)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

本文归纳了问句形式在问句语料筛选中的作用,探索了问句分类必需的形式特征,同时通过人工标注建设了中文问句分类语料库,并在此基础上进行了基于规则和统计的分类实验,通过多轮实验迭代优化特征组合形成特征规则集,为当前问答提供形式上的分类基础。实验中,基于优化特征规则集的有限状态自动机可实现宏平均F1值为0.94;统计机器学习中随机森林模型的分类效果较好,F1值宏平均达到0.98,表明问句形式分类具有相当可行性和准确性。

pdf bib abs

A Corpus Linguistic Perspective on the Appropriateness of Pop Songs for Teaching Chinese as a Second Language
Xiangyu Chi | Gaoqi Rao
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications

Language and music are closely related. Regarding the linguistic feature richness, pop songs are probably suitable to be used as extracurricular materials in language teaching. In order to prove this point, this paper presents the Contemporary Chinese Pop Lyrics (CCPL) corpus. Based on that, we investigated and evaluated the appropriateness of pop songs for Teaching Chinese as a Second Language (TCSL) with the assistance of Natural Language Processing methods from the perspective of Chinese character coverage, lexical coverage and the addressed topic similarity. Some suggestions in Chinese teaching with the aid of pop lyrics are provided.

pdf bib abs

Overview of NLPTEA-2020 Shared Task for Chinese Grammatical Error Diagnosis
Gaoqi Rao | Erhong Yang | Baolin Zhang
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications

This paper presents the NLPTEA 2020 shared task for Chinese Grammatical Error Diagnosis (CGED) which seeks to identify grammatical error types, their range of occurrence and recommended corrections within sentences written by learners of Chinese as a foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 30 teams registered for this shared task, 17 teams developed the system and submitted a total of 43 runs. System performances achieved a significant progress, reaching F1 of 91% in detection level, 40% in position level and 28% in correction level. All data sets with gold standards and scoring scripts are made publicly available to researchers.

pdf bib abs

基于组块分析的汉语块依存语法(Chinese Chunk-Based Dependency Grammar)
Qingqing Qian (钱青青) | Chengwen Wang (王诚文) | Gaoqi Rao (饶高琦) | Endong Xun (荀恩东)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

基于词单元的经典依存语法在面向中文的句子分析中遇到诸多汉语特性引起的困难。为此,本文提出汉语的块依存语法,以谓词为核心,以组块为研究对象,在句内和句间寻找谓词所支配的组块,构建句群级别的句法分析框架。这一操作不仅仅是提升叶子节点的语言单位,而且还针对汉语语义特点进行了分析方式和分析规则上的创新,能够较好地解决微观层次的逻辑结构知识,并为中观论元知识和宏观篇章知识打好铺垫。本文主要介绍了块依存语法理念、表示、分析方法及特点,并简要介绍了块依存树库的构建情况。截至目前为止,树库规模为187万字符(超过4万复句、10万小句),其中包含67%新闻文本和32%百科文本。

pdf bib

Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications
Erhong YANG | Endong XUN | Baolin ZHANG | Gaoqi RAO
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications

2018

pdf bib abs

Overview of NLPTEA-2018 Share Task Chinese Grammatical Error Diagnosis
Gaoqi Rao | Qi Gong | Baolin Zhang | Endong Xun
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

This paper presents the NLPTEA 2018 shared task for Chinese Grammatical Error Diagnosis (CGED) which seeks to identify grammatical error types, their range of occurrence and recommended corrections within sentences written by learners of Chinese as foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 20 teams registered for this shared task, 13 teams developed the system and submitted a total of 32 runs. Progress in system performances was obviously, reaching F1 of 36.12% in position level and 25.27% in correction level. All data sets with gold standards and scoring scripts are made publicly available to researchers.

2017

pdf bib abs

IJCNLP-2017 Task 1: Chinese Grammatical Error Diagnosis
Gaoqi Rao | Baolin Zhang | Endong Xun | Lung-Hao Lee
Proceedings of the IJCNLP 2017, Shared Tasks

This paper presents the IJCNLP 2017 shared task for Chinese grammatical error diagnosis (CGED) which seeks to identify grammatical error types and their range of occurrence within sentences written by learners of Chinese as foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 13 teams registered for this shared task, 5 teams developed the system and submitted a total of 13 runs. We expected this evaluation campaign could lead to the development of more advanced NLP techniques for educational applications, especially for Chinese error detection. All data sets with gold standards and scoring scripts are made publicly available to researchers.

2016

pdf bib abs

Overview of NLP-TEA 2016 Shared Task for Chinese Grammatical Error Diagnosis
Lung-Hao Lee | Gaoqi Rao | Liang-Chih Yu | Endong Xun | Baolin Zhang | Li-Ping Chang
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)

This paper presents the NLP-TEA 2016 shared task for Chinese grammatical error diagnosis which seeks to identify grammatical error types and their range of occurrence within sentences written by learners of Chinese as foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 15 teams registered for this shared task, 9 teams developed the system and submitted a total of 36 runs. We expected this evaluation campaign could lead to the development of more advanced NLP techniques for educational applications, especially for Chinese error detection. All data sets with gold standards and scoring scripts are made publicly available to researchers.

Co-authors

Venues

Fix author