Hua Zheng


pdf bib
Morpheme Sense Disambiguation: A New Task Aiming for Understanding the Language at Character Level
Yue Wang | Hua Zheng | Yaqi Yin | Hansi Wang | Qiliang Liang | Yang Liu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Morphemes serve as a strong linguistic feature to capture lexical semantics, with higher coverage than words and more natural than sememes. However, due to the lack of morpheme-informed resources and the expense of manual annotation, morpheme-enhanced methods remain largely unexplored in Computational Linguistics. To address this issue, we propose the task of Morpheme Sense Disambiguation (MSD), with two subtasks in-text and in-word, similar to Word Sense Disambiguation (WSD) and Sememe Prediction (SP), to generalize morpheme features on more tasks. We first build the MorDis resource for Chinese, including MorInv as a morpheme inventory, MorTxt and MorWrd as two types of morpheme-annotated datasets. Next, we provide two baselines in each evaluation; the best model yields a promising precision of 77.66% on in-text MSD and 88.19% on in-word MSD, indicating its comparability with WSD and superiority over SP. Finally, we demonstrate that predicted morphemes achieve comparable performance with the ground-truth ones on a downstream application of Definition Generation (DG). This validates the feasibility and applicability of our proposed tasks. The resources and workflow of MSD will provide new insights and solutions for downstream tasks, including DG as well as WSD, training pre-trained models, etc.


pdf bib
Decompose, Fuse and Generate: A Formation-Informed Method for Chinese Definition Generation
Hua Zheng | Damai Dai | Lei Li | Tianyu Liu | Zhifang Sui | Baobao Chang | Yang Liu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this paper, we tackle the task of Definition Generation (DG) in Chinese, which aims at automatically generating a definition for a word. Most existing methods take the source word as an indecomposable semantic unit. However, in parataxis languages like Chinese, word meanings can be composed using the word formation process, where a word (“桃花”, peach-blossom) is formed by formation components (“桃”, peach; “花”, flower) using a formation rule (Modifier-Head). Inspired by this process, we propose to enhance DG with word formation features. We build a formation-informed dataset, and propose a model DeFT, which Decomposes words into formation features, dynamically Fuses different features through a gating mechanism, and generaTes word definitions. Experimental results show that our method is both effective and robust.

pdf bib
Inductively Representing Out-of-Knowledge-Graph Entities by Optimal Estimation Under Translational Assumptions
Damai Dai | Hua Zheng | Fuli Luo | Pengcheng Yang | Tianyu Liu | Zhifang Sui | Baobao Chang
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

Conventional Knowledge Graph Completion (KGC) assumes that all test entities appear during training. However, in real-world scenarios, Knowledge Graphs (KG) evolve fast with out-of-knowledge-graph (OOKG) entities added frequently, and we need to efficiently represent these entities. Most existing Knowledge Graph Embedding (KGE) methods cannot represent OOKG entities without costly retraining on the whole KG. To enhance efficiency, we propose a simple and effective method that inductively represents OOKG entities by their optimal estimation under translational assumptions. Moreover, given pretrained embeddings of the in-knowledge-graph (IKG) entities, our method even needs no additional learning. Experimental results on two KGC tasks with OOKG entities show that our method outperforms the previous methods by a large margin with higher efficiency.

pdf bib
基于词信息嵌入的汉语构词结构识别研究(Chinese Word-Formation Prediction based on Representations of Word-Related Features)
Hua Zheng (郑婳) | Yaqi Yan (殷雅琦) | Yue Wang (王悦) | Damai Dai (代达劢) | Yang Liu (刘扬)
Proceedings of the 20th Chinese National Conference on Computational Linguistics


pdf bib
阅读分级相关研究综述(A Survey of Leveled Reading)
Simin Rao (饶思敏) | Hua Zheng (郑婳) | Sujian Li (李素建)
Proceedings of the 20th Chinese National Conference on Computational Linguistics


pdf bib
Leveraging Word-Formation Knowledge for Chinese Word Sense Disambiguation
Hua Zheng | Lei Li | Damai Dai | Deli Chen | Tianyu Liu | Xu Sun | Yang Liu
Findings of the Association for Computational Linguistics: EMNLP 2021

In parataxis languages like Chinese, word meanings are constructed using specific word-formations, which can help to disambiguate word senses. However, such knowledge is rarely explored in previous word sense disambiguation (WSD) methods. In this paper, we propose to leverage word-formation knowledge to enhance Chinese WSD. We first construct a large-scale Chinese lexical sample WSD dataset with word-formations. Then, we propose a model FormBERT to explicitly incorporate word-formations into sense disambiguation. To further enhance generalizability, we design a word-formation predictor module in case word-formation annotations are unavailable. Experimental results show that our method brings substantial performance improvement over strong baselines.

pdf bib
Cross-Lingual Leveled Reading Based on Language-Invariant Features
Simin Rao | Hua Zheng | Sujian Li
Findings of the Association for Computational Linguistics: EMNLP 2021

Leveled reading (LR) aims to automatically classify texts by the cognitive levels of readers, which is fundamental in providing appropriate reading materials regarding different reading capabilities. However, most state-of-the-art LR methods rely on the availability of copious annotated resources, which prevents their adaptation to low-resource languages like Chinese. In our work, to tackle LR in Chinese, we explore how different language transfer methods perform on English-Chinese LR. Specifically, we focus on adversarial training and cross-lingual pre-training method to transfer the LR knowledge learned from annotated data in the resource-rich English language to Chinese. For evaluation, we first introduce the age-based standard to align datasets with different leveling standards. Then we conduct experiments in both zero-shot and few-shot settings. Comparing these two methods, quantitative and qualitative evaluations show that the cross-lingual pre-training method effectively captures the language-invariant features between English and Chinese. We conduct analysis to propose further improvement in cross-lingual LR.