Due to the high complexity of Discourse Dependency Parsing (DDP) tasks, their existing annotation resources are relatively scarce compared to other NLP tasks, and different DDP tasks also have significant differences in annotation schema. These issues have led to the dilemma of low resources for DDP tasks. Thanks to the powerful capabilities of Large Language Models (LLMs) in cross-task learning, we can use LLMs to model dependency parsing under different annotation schema in an unified manner, in order to alleviate the dilemma of low resources for DDP tasks. However, enabling LLMs to deeply comprehend dependency parsing tasks is a challenge that remains underexplored. Inspired by the application of code-based methods in complex tasks, we propose a code-based unified dependency parsing method. We treat the process of dependency parsing as a search process of dependency paths and use code to represent this search process. Furthermore, we use a curriculum-learning based instruction tuning strategy for joint training of multiple dependency parsing tasks. The experimental results show that our proposed code-based DDP system has achieved good performance on two Chinese DDP tasks (especially significant improvement on the DDP task with relatively less training data).
Traditional Chinese character is an important carrier of Chinese culture, and is still actively used in many areas. Automatic conversion between traditional and simplified Chinese characters can help modern people understand traditional culture and facilitate communication among different regions. Previous conversion methods rely on rule-based mapping or shallow feature-based machine learning models, which struggle to convert simplified characters with different origins and constructing training data is costly. In this study, we propose an unsupervised adaptive context-aware conversion model that learns to convert between simplified and traditional Chinese characters under a denoising auto-encoder framework requiring no labeled data. Our model includes a Latent Generative Adversarial Encoder that transforms vectors to a latent space with generative adversarial network, which adds noise as an inevitable side effect, Based on which a Context-aware Semantic Reconstruction Decoder restores the original input while considering a broader range of context with a pretrained language model. Additionally, we propose to apply early exit mechanism during inference to reduce the computation complexity and improve the generalization ability. To test the effectiveness of our model, we construct a high quality test dataset with simplified-traditional Chinese character text pairs. Experiment results and extensive analysis demonstrate that our model outperforms strong unsupervised baselines and yields better conversion result for one-to-many cases.
Word Segmentation is a fundamental step for understanding Chinese language. Previous neural approaches for unsupervised Chinese Word Segmentation (CWS) only exploits shallow semantic information, which can miss important context. Large scale Pre-trained language models (PLM) have achieved great success in many areas because of its ability to capture the deep contextual semantic relation. In this paper, we propose to take advantage of the deep semantic information embedded in PLM (e.g., BERT) with a self-training manner, which iteratively probes and transforms the semantic information in PLM into explicit word segmentation ability. Extensive experiment results show that our proposed approach achieves state-of-the-art F1 score on two CWS benchmark datasets.
Deep learning-based Chinese zero pronoun resolution model has achieved better performance than traditional machine learning-based model. However, the existing work related to Chinese zero pronoun resolution has not yet well integrated linguistic information into the deep learningbased Chinese zero pronoun resolution model. This paper adopts the idea based on the pre-trained model, and integrates the semantic representations in the pre-trained Chinese semantic dependency graph parser into the Chinese zero pronoun resolution model. The experimental results on OntoNotes-5.0 dataset show that our proposed Chinese zero pronoun resolution model with pretrained Chinese semantic dependency parser improves the F-score by 0.4% compared with our baseline model, and obtains better results than other deep learning-based Chinese zero pronoun resolution models. In addition, we integrate the BERT representations into our model so that the performance of our model was improved by 0.7% compared with our baseline model.