Yang Hou

2025

Dynamic Head Selection for Neural Lexicalized Constituency Parsing
Yang Hou | Zhenghua Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Lexicalized parsing, which associates constituent nodes with lexical heads, has historically played a crucial role in constituency parsing by bridging constituency and dependency structures. Nevertheless, with the advent of neural networks, lexicalized structures have generally been neglected in favor of unlexicalized, span-based methods. In this paper, we revisit lexicalized parsing and propose a novel latent lexicalization framework that dynamically infers lexical heads during training without relying on predefined head-finding rules. Our method enables the model to learn lexical dependencies directly from data, offering greater adaptability across languages and datasets. Experiments on multiple treebanks demonstrate state-of-the-art or comparable performance. We also analyze the learned dependency structures, headword preferences, and linguistic biases.

pdf bib abs

Inspired by early research on exploring naturally annotated data for Chinese Word Segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to explicitly mine word boundaries from parallel speech-text data. We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data, giving pauses as candidate word boundaries. Based on detailed analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries. To more effectively utilize word boundaries as extra training data, we also propose a robust complete-then-train (CTT) strategy. We conduct cross-domain CWS experiments on two target domains, i.e., ZX and AISHELL2. We have annotated about 1K sentences as the evaluation data of AISHELL2. Experiments demonstrate the effectiveness of our proposed approach.

pdf bib abs

Data Augmentation for Cross-domain Parsing via Lightweight LLM Generation and Tree Hybridization
Ziyan Zhang | Yang Hou | Chen Gong | Zhenghua Li
Proceedings of the 31st International Conference on Computational Linguistics

Cross-domain constituency parsing remains a challenging task due to the lack of high-quality out-of-domain data. In this paper, we propose a data augmentation method via lightweight large language model (LLM) generation and tree hybridization. We utilize LLM to generate phrase structures (subtrees) for the target domain by incorporating grammar rules and lexical head information into the prompt. To better leverage LLM-generated target-domain subtrees, we hybridize them with existing source-domain subtrees to efficiently produce a large number of structurally diverse instances. Experimental results demonstrate that our method achieves significant improvements on five target domains with a lightweight LLM generation cost.

pdf bib abs

A Probabilistic Toolkit for Multi-grained Word Segmentation in Chinese
Xi Ma | Yang Hou | Xuebin Wang | Zhenghua Li
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

It is practically useful to provide consistent and reliable word segmentation results from different criteria at the same time, which is formulated as the multi-grained word segmentation (MWS) task. This paper describes a probabilistic toolkit for MWS in Chinese. We propose a new MWS approach based on the standard MTL framework. We adopt semi-Markov CRF for single-grained word segmentation (SWS), which can produce marginal probabilities of words during inference. For sentences that contain conflicts among SWS results, we employ the CKY decoding algorithm to resolve conflicts.Our resulting MWS tree can provide the criteria information of words, along with the probabilities. Moreover, we follow the works in SWS, and propose a simple strategy to exploit naturally annotated data for MWS, leading to substantial improvement of MWS performance in the cross-domain scenario.

pdf bib abs

Span-based Semantic Role Labeling as Lexicalized Constituency Tree Parsing
Yang Hou | Zhenghua Li
Findings of the Association for Computational Linguistics: ACL 2025

Semantic Role Labeling (SRL) is a critical task that focuses on identifying predicate-argument structures in sentences. Span-based SRL, a prominent paradigm, is often tackled using BIO-based or graph-based methods. However, these approaches often fail to capture the inherent relationship between syntax and semantics. While syntax-aware models have been proposed to address this limitation, they heavily rely on pre-existing syntactic resources, limiting their general applicability. In this work, we propose a lexicalized tree representation for span-based SRL, which integrates constituency and dependency parsing to explicitly model predicate-argument structures. By structurally representing predicates as roots and arguments as subtrees directly linked to the predicate, our approach bridges the gap between syntactic and semantic representations. Experiments on standard English benchmarks (CoNLL05 and CoNLL12) demonstrate that our model achieves competitive performance, with particular improvement in predicate-given settings.

pdf bib abs

Self-Correction Makes LLMs Better Parsers
Ziyan Zhang | Yang Hou | Chen Gong | Zhenghua Li
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) have achieved remarkable success across various natural language processing (NLP) tasks. However, recent studies suggest that they still face challenges in performing fundamental NLP tasks essential for deep language understanding, particularly syntactic parsing. In this paper, we conduct an in-depth analysis of LLM parsing capabilities, delving into the underlying causes of why LLMs struggle with this task and the specific shortcomings they exhibit. We find that LLMs may be limited in their ability to fully leverage grammar rules from existing treebanks, restricting their capability to generate syntactic structures. To help LLMs acquire knowledge without additional training, we propose a self-correction method that leverages grammar rules from existing treebanks to guide LLMs in correcting previous errors. Specifically, we automatically detect potential errors and dynamically search for relevant rules, offering hints and examples to guide LLMs in making corrections themselves. Experimental results on three datasets using various LLMs demonstrate that our method significantly improves performance in both in-domain and cross-domain settings.

2024

pdf bib abs

Character-Level Chinese Dependency Parsing via Modeling Latent Intra-Word Structure
Yang Hou | Zhenghua Li
Findings of the Association for Computational Linguistics: ACL 2024

Revealing the syntactic structure of sentences in Chinese poses significant challenges for word-level parsers due to the absence of clear word boundaries. To facilitate a transition from word-level to character-level Chinese dependency parsing, this paper proposes modeling latent internal structures within words. In this way, each word-level dependency tree is interpreted as a forest of character-level trees. A constrained Eisner algorithm is implemented to ensure the compatibility of character-level trees, guaranteeing a single root for intra-word structures and establishing inter-word dependencies between these roots. Experiments on Chinese treebanks demonstrate the superiority of our method over both the pipeline framework and previous joint models. A detailed analysis reveals that a coarse-to-fine parsing strategy empowers the model to predict more linguistically plausible intra-word structures.

pdf bib abs

High-order Joint Constituency and Dependency Parsing
Yanggan Gu | Yang Hou | Zhefeng Wang | Xinyu Duan | Zhenghua Li
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This work revisits the topic of jointly parsing constituency and dependency trees, i.e., to produce compatible constituency and dependency trees simultaneously for input sentences, which is attractive considering that the two types of trees are complementary in representing syntax. The original work of Zhou and Zhao (2019) performs joint parsing only at the inference phase. They train two separate parsers under the multi-task learning framework (i.e., one shared encoder and two independent decoders). They design an ad-hoc dynamic programming-based decoding algorithm of O(n⁵) time complexity for finding optimal compatible tree pairs. Compared to their work, we make progress in three aspects: (1) adopting a much more efficient decoding algorithm of O(n⁴) time complexity, (2) exploring joint modeling at the training phase, instead of only at the inference phase, (3) proposing high-order scoring components to promote constituent-dependency interaction. We conduct experiments and analysis on seven languages, covering both rich-resource and low-resource scenarios. Results and analysis show that joint modeling leads to a modest overall performance boost over separate modeling, but substantially improves the complete matching ratio of whole trees, thanks to the explicit modeling of tree compatibility.

2021

pdf bib abs

The most straightforward approach to joint word segmentation (WS), part-of-speech (POS) tagging, and constituent parsing is converting a word-level tree into a char-level tree, which, however, leads to two severe challenges. First, a larger label set (e.g., ≥ 600) and longer inputs both increase computational costs. Second, it is difficult to rule out illegal trees containing conflicting production rules, which is important for reliable model evaluation. If a POS tag (like VV) is above a phrase tag (like VP) in the output tree, it becomes quite complex to decide word boundaries. To deal with both challenges, this work proposes a two-stage coarse-to-fine labeling framework for joint WS-POS-PAR. In the coarse labeling stage, the joint model outputs a bracketed tree, in which each node corresponds to one of four labels (i.e., phrase, subphrase, word, subword). The tree is guaranteed to be legal via constrained CKY decoding. In the fine labeling stage, the model expands each coarse label into a final label (such as VP, VP*, VV, VV*). Experiments on Chinese Penn Treebank 5.1 and 7.0 show that our joint model consistently outperforms the pipeline approach on both settings of w/o and w/ BERT, and achieves new state-of-the-art performance.

Co-authors

Xi Ma 1

Venues

Fix author