Unsupervised constituency parsing focuses on identifying word sequences that form a syntactic unit (i.e., constituents) in target sentences. Linguists identify the constituent by evaluating a set of Predicate-Argument Structure (PAS) equivalent sentences where we find the constituent appears more frequently than non-constituents (i.e., the constituent corresponds to a frequent word sequence within the sentence set). However, such frequency information is unavailable in previous parsing methods that identify the constituent by observing sentences with diverse PAS. In this study, we empirically show that constituents correspond to frequent word sequences in the PAS-equivalent sentence set. We propose a frequency-based parser, span-overlap, that (1) computes the span-overlap score as the word sequence’s frequency in the PAS-equivalent sentence set and (2) identifies the constituent structure by finding a constituent tree with the maximum span-overlap score. The parser achieves state-of-the-art level parsing accuracy, outperforming existing unsupervised parsers in eight out of ten languages. Additionally, we discover a multilingual phenomenon: participant-denoting constituents tend to have higher span-overlap scores than equal-length event-denoting constituents, meaning that the former tend to appear more frequently in the PAS-equivalent sentence set than the latter. The phenomenon indicates a statistical difference between the two constituent types, laying the foundation for future labeled unsupervised parsing research.
Previous methods based on Large Language Models (LLM) perform unsupervised dependency parsing by maximizing bi-lexical dependence scores. However, these previous methods adopt dependence scores that are difficult to interpret. These methods cannot incorporate grammatical constraints that previous grammar-based parsing research has shown beneficial to improving parsing performance. In this work, we apply Conditional Mutual Information (CMI), an interpretable metric, to measure the bi-lexical dependence and incorporate grammatical constraints into LLM-based unsupervised parsing. We incorporate Part-Of-Speech information as a grammatical constraint at the CMI estimation stage and integrate two additional grammatical constraints at the subsequent tree decoding stage. We find that the CMI score positively correlates with syntactic dependencies and has a stronger correlation with the syntactic dependencies than baseline scores. Our experiment confirms the benefits and applicability of the proposed grammatical constraints across five languages and eight datasets. The CMI parsing model outperforms state-of-the-art LLM-based models and similarly constrained grammar-based models. Our analysis reveals that the CMI model is strong in retrieving dependency relations with rich lexical interactions but is weak in retrieving relations with sparse lexical interactions, indicating a potential limitation in CMI-based unsupervised parsing methods.
In this paper, we propose a mixture model-based end-to-end method to model the syntactic-semantic dependency correlation in Semantic Role Labeling (SRL). Semantic dependencies in SRL are modeled as a distribution over semantic dependency labels conditioned on a predicate and an argument word. The semantic label distribution varies depending on Shortest Syntactic Dependency Path (SSDP) hop patterns. We target the variation of semantic label distributions using a mixture model, separately estimating semantic label distributions for different hop patterns and probabilistically clustering hop patterns with similar semantic label distributions. Experiments show that the proposed method successfully learns a cluster assignment reflecting the variation of semantic label distributions. Modeling the variation improves performance in predicting short distance semantic dependencies, in addition to the improvement on long distance semantic dependencies that previous syntax-aware methods have achieved. The proposed method achieves a small but statistically significant improvement over baseline methods in English, German, and Spanish and obtains competitive performance with state-of-the-art methods in English.