Chen Gong


2024

pdf bib
Chinese Spoken Named Entity Recognition in Real-world Scenarios: Dataset and Approaches
Shilin Zhou | Zhenghua Li | Chen Gong | Lei Zhang | Yu Hong | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2024

Spoken Named Entity Recognition (NER) aims to extract entities from speech. The extracted entities can help voice assistants better understand user’s questions and instructions. However, current Chinese Spoken NER datasets are laboratory-controlled data that are collected by reading existing texts in quiet environments, rather than natural spoken data, and the texts used for reading are also limited in topics. These limitations obstruct the development of Spoken NER in more natural and common real-world scenarios. To address this gap, we introduce a real-world Chinese Spoken NER dataset (RWCS-NER), encompassing open-domain daily conversations and task-oriented intelligent cockpit instructions. We compare several mainstream pipeline approaches on RWCS-NER. The results indicate that the current methods, affected by Automatic Speech Recognition (ASR) errors, do not perform satisfactorily in real settings. Aiming to enhance Spoken NER in real-world scenarios, we propose two approaches: self-training-asr and mapping then distilling (MDistilling). Experiments show that both approaches can achieve significant improvements, particularly MDistilling. Even compared with GPT4.0, MDistilling still reaches better results. We believe that our work will advance the field of Spoken NER in real-world settings.

pdf bib
MODDP: A Multi-modal Open-domain Chinese Dataset for Dialogue Discourse Parsing
Chen Gong | DeXin Kong | Suxian Zhao | Xingyu Li | Guohong Fu
Findings of the Association for Computational Linguistics: ACL 2024

Dialogue discourse parsing (DDP) aims to capture the relations between utterances in the dialogue. In everyday real-world scenarios, dialogues are typically multi-modal and cover open-domain topics. However, most existing widely used benchmark datasets for DDP contain only textual modality and are domain-specific. This makes it challenging to accurately and comprehensively understand the dialogue without multi-modal clues, and prevents them from capturing the discourse structures of the more prevalent daily conversations. This paper proposes MODDP, the first multi-modal Chinese discourse parsing dataset derived from open-domain daily dialogues, consisting 864 dialogues and 18,114 utterances, accompanied by 12.7 hours of video clips. We present a simple yet effective benchmark approach for multi-modal DDP. Through extensive experiments, we present several benchmark results based on MODDP. The significant improvement in performance from introducing multi-modalities into the original textual unimodal DDP model demonstrates the necessity of integrating multi-modalities into DDP.

pdf bib
Improving Chinese Named Entity Recognition with Multi-grained Words and Part-of-Speech Tags via Joint Modeling
Chenhui Dou | Chen Gong | Zhenghua Li | Zhefeng Wang | Baoxing Huai | Min Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Nowadays, character-based sequence labeling becomes the mainstream Chinese named entity recognition (CNER) approach, instead of word-based methods, since the latter degrades performance due to propagation of word segmentation (WS) errors. To make use of WS information, previous studies usually learn CNER and WS simultaneously with multi-task learning (MTL) framework, or treat WS information as extra guide features for CNER model, in which the utilization of WS information is indirect and shallow. In light of the complementary information inside multi-grained words, and the close connection between named entities and part-of-speech (POS) tags, this work proposes a tree parsing approach for joint modeling CNER, multi-grained word segmentation (MWS) and POS tagging tasks simultaneously. Specifically, we first propose a unified tree representation for MWS, POS tagging, and CNER.Then, we automatically construct the MWS-POS-NER data based on the unified tree representation for model training. Finally, we present a two-stage joint tree parsing framework. Experimental results on OntoNotes4 and OntoNotes5 show that our proposed approach of jointly modeling CNER with MWS and POS tagging achieves better or comparable performance with latest methods.

2022

pdf bib
MuCPAD: A Multi-Domain Chinese Predicate-Argument Dataset
Yahui Liu | Haoping Yang | Chen Gong | Qingrong Xia | Zhenghua Li | Min Zhang
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

During the past decade, neural network models have made tremendous progress on in-domain semantic role labeling (SRL). However, performance drops dramatically under the out-of-domain setting. In order to facilitate research on cross-domain SRL, this paper presents MuCPAD, a multi-domain Chinese predicate-argument dataset, which consists of 30,897 sentences and 92,051 predicates from six different domains. MuCPAD exhibits three important features. 1) Based on a frame-free annotation methodology, we avoid writing complex frames for new predicates. 2) We explicitly annotate omitted core arguments to recover more complete semantic structure, considering that omission of content words is ubiquitous in multi-domain Chinese texts. 3) We compile 53 pages of annotation guidelines and adopt strict double annotation for improving data quality. This paper describes in detail the annotation methodology and annotation process of MuCPAD, and presents in-depth data analysis. We also give benchmark results on cross-domain SRL based on MuCPAD.

2021

pdf bib
An In-depth Study on Internal Structure of Chinese Words
Chen Gong | Saihao Huang | Houquan Zhou | Zhenghua Li | Min Zhang | Zhefeng Wang | Baoxing Huai | Nicholas Jing Yuan
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Unlike English letters, Chinese characters have rich and specific meanings. Usually, the meaning of a word can be derived from its constituent characters in some way. Several previous works on syntactic parsing propose to annotate shallow word-internal structures for better utilizing character-level information. This work proposes to model the deep internal structures of Chinese words as dependency trees with 11 labels for distinguishing syntactic relationships. First, based on newly compiled annotation guidelines, we manually annotate a word-internal structure treebank (WIST) consisting of over 30K multi-char words from Chinese Penn Treebank. To guarantee quality, each word is independently annotated by two annotators and inconsistencies are handled by a third senior annotator. Second, we present detailed and interesting analysis on WIST to reveal insights on Chinese word formation. Third, we propose word-internal structure parsing as a new task, and conduct benchmark experiments using a competitive dependency parser. Finally, we present two simple ways to encode word-internal structures, leading to promising gains on the sentence-level syntactic parsing task.

pdf bib
数据标注方法比较研究:以依存句法树标注为例(Comparison Study on Data Annotation Approaches: Dependency Tree Annotation as Case Study)
Mingyue Zhou (周明月) | Chen Gong (龚晨) | Zhenghua Li (李正华) | Min Zhang (张民)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

数据标注最重要的考虑因素是数据的质量和标注代价。我们调研发现自然语言处理领域的数据标注工作通常采用机标人校的标注方法以降低代价;同时,很少有工作严格对比不同标注方法,以探讨标注方法对标注质量和代价的影响。该文借助一个成熟的标注团队,以依存句法数据标注为案例,实验对比了机标人校、双人独立标注、及本文通过融合前两种方法所新提出的人机独立标注方法,得到了一些初步的结论。

2020

pdf bib
Multi-grained Chinese Word Segmentation with Weakly Labeled Data
Chen Gong | Zhenghua Li | Bowei Zou | Min Zhang
Proceedings of the 28th International Conference on Computational Linguistics

In contrast with the traditional single-grained word segmentation (SWS), where a sentence corresponds to a single word sequence, multi-grained Chinese word segmentation (MWS) aims to segment a sentence into multiple word sequences to preserve all words of different granularities. Due to the lack of manually annotated MWS data, previous work train and tune MWS models only on automatically generated pseudo MWS data. In this work, we further take advantage of the rich word boundary information in existing SWS data and naturally annotated data from dictionary example (DictEx) sentences, to advance the state-of-the-art MWS model based on the idea of weak supervision. Particularly, we propose to accommodate two types of weakly labeled data for MWS, i.e., SWS data and DictEx data by employing a simple yet competitive graph-based parser with local loss. Besides, we manually annotate a high-quality MWS dataset according to our newly compiled annotation guideline, consisting of over 9,000 sentences from two types of texts, i.e., canonical newswire (NEWS) and non-canonical web (BAIKE) data for better evaluation. Detailed evaluation shows that our proposed model with weakly labeled data significantly outperforms the state-of-the-art MWS model by 1.12 and 5.97 on NEWS and BAIKE data in F1.

2017

pdf bib
Multi-Grained Chinese Word Segmentation
Chen Gong | Zhenghua Li | Min Zhang | Xinzhou Jiang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Traditionally, word segmentation (WS) adopts the single-grained formalism, where a sentence corresponds to a single word sequence. However, Sproat et al. (1997) show that the inter-native-speaker consistency ratio over Chinese word boundaries is only 76%, indicating single-grained WS (SWS) imposes unnecessary challenges on both manual annotation and statistical modeling. Moreover, WS results of different granularities can be complementary and beneficial for high-level applications. This work proposes and addresses multi-grained WS (MWS). We build a large-scale pseudo MWS dataset for model training and tuning by leveraging the annotation heterogeneity of three SWS datasets. Then we manually annotate 1,500 test sentences with true MWS annotations. Finally, we propose three benchmark approaches by casting MWS as constituent parsing and sequence labeling. Experiments and analysis lead to many interesting findings.