Rule Augmented Unsupervised Constituency Parsing

Recently, unsupervised parsing of syntactic trees has gained considerable attention. A prototypical approach to such unsupervised parsing employs reinforcement learning and auto-encoders. However, no mechanism ensures that the learnt model leverages the well-understood language grammar. We propose an approach that utilizes very generic linguistic knowledge of the language present in the form of syntactic rules, thus inducing better syntactic structures. We introduce a novel formulation that takes advantage of the syntactic grammar rules and is independent of the base system. We achieve new state-of-the-art results on two benchmarks datasets, MNLI and WSJ. The source code of the paper is available at https://github.com/anshuln/Diora_with_rules.


Introduction
Syntactic parse trees have demonstrated their importance in several downstream NLP applications such as machine translation (Eriguchi et al., 2017;Zaremoodi and Haffari, 2018), natural language inference (NLI) (Choi et al., 2018), relation extraction (Gamallo et al., 2012) and text classification (Tai et al., 2015). Based on linguistic theories that have promoted the usefulness of tree-based representation of natural language text, tree-based models such as Tree-LSTM have been proposed to learn sentence representations (Socher et al., 2011). Inspired by the Tree-LSTM based models, many approaches were proposed do not require parse tree supervision (Yogatama et al., 2017;Choi et al., 2018;Maillard et al., 2019;Drozdov et al., 2019). However, (Williams et al., 2018;Sahay et al., 2021) have shown that these methods cannot learn meaningful semantics (not even simple grammar), though they perform well on NLI tasks. Recently, there has been surge in approaches using weak supervision in the form of rules for various * Equal contribution 1 The source code of the paper is available at https://github.com/anshuln/Diora with rules. tasks such as sequence classification (Safranchik et al., 2020), text classification (Chatterjee et al., 2020;Maheshwari et al., 2020), etc. These approaches have demonstrated the importance of external knowledge in both unsupervised and supervised setup. To the best of our knowledge, previous works on syntactic parse tree has not utilized such external information. In this paper, we propose an approach that leverages linguistic (and potentially domain agnostic) knowledge in the form of explicit syntactic grammar rules while building upon a state of the art, deep and unsupervised inside-outside recursive autoencoder (DIORA; (Drozdov et al., 2019)). DIORA is an unsupervised model that uses inside-outside dynamic programming to compose latent representations from all possible binary trees. We extend DIORA and propose a framework that harness grammar rules to learn constituent parse trees. We use context free grammar (CFG) productions for English language (like NP →VP NP, PP →IN NP, etc) as rules. Note that the construction of such a rule set is a one time effort and our method is independent of any underlying dataset. The rule sets used are available in our github repository. Summarily, our main contributions are : (a) a framework (cf., Section 3) that uses (potentially domain agnostic), off-the-shelf CFG to learn to produce constituent parse trees (b) two rule-aware loss functions (cf., Section 3.1) that maximize some form of agreement between the unsupervised model and the rule-based model (c) experimental analysis (cf., Section 4), demonstrating improvements on unsupervised constituency parsing over previous state-of-the art by over 3% on two benchmark datasets.

Background and Related Work
A brief survey of latent tree learning models is covered in (Williams et al., 2018) Figure 1: For the input 'The cat sat', DIORA computes e(i 1 , j 1 ) compatibility score for each pair of neighboring constituents. l(i 1 , j 1 ) is computed using triggered rules for each span and it interacts with the compatibility score in our loss function as explained in Section 3.1. constituency trees (Brill et al., 1990;Ando and Lee, 2000) using dependency parsers (Klein and Manning, 2004) and inside-outside parsing algorithm (Drozdov et al., 2019). Recently, (Drozdov et al., 2019) proposed an unsupervised latent chart tree parsing algorithm, viz., DIORA, that uses the inside-outside algorithm for parsing and has an autoencoder-based neural network trained to reconstruct the input sentence. DIORA is trained end to end using masked language model via word prediction. As of date, DIORA is the state-of-the-art approach to unsupervised constituency parsing.
Exploiting additional semantic and syntactic information that acts as a source of additional guidance rather than the primary objective function has been discussed since 1990s (Sun et al., 1993). Recently, Kim et al. (2019b) proposed to learn CFG rules and their probabilities by the parameterizing terminal or non-terminal symbols with neural networks. However, our approach leverages predefined language CFG rules and provisions for augmenting an existing (state-of-the-art) inside-outside algorithm with such external knowledge.
More specifically, we augment DIORA (Drozdov et al., 2019) with CFG rules to reconstruct the input by exploiting syntactic information of the language. We next provide some technical details of the inside-outside algorithm of DIORA.

DIORA
DIORA learns constituency trees from the raw input text using an unsupervised training procedure that operates like a masked language model or denoising autoencoder. It encodes the entire input sequence into a single vector analogous to the encoding step in an autoencoder. Thereafter, the decoder is trained to reconstruct and reproduce each input word. We next describe the inside and the outside pass of DIORA, respectively. Inside Pass: Given an input sentence with T tokens x 0 , x 1 , x 2 . . . x T −1 , DIORA computes a compatibility score e and a composition vector a for each pair of neighboring constituents i and j. It composes a vector a weighing over all possible pairs of constituents of i and j: a(k) The composition vector, a(k) is a weighted sum of all possible constituent pairs, k. Here,ê is a bilinear function of the vectors from neighboring spans, a(i) and a(j). Composition vector a(k) is learnt using a TreeL-STM or multi-layer neural network (MLP).
Outside Pass: The outside pass of DIORA computes an outside vector b(k) representing the constituents not in x i:j . It computes the values for a target space (i, j) recursively from its sibling (j+1, k) and outside spans (0, i − 1) and (k + 1, T − 1).
Training and Inference: DIORA is trained end to end using masked language model via word prediction. The missing token x i is predicted from the outside vector b(k). The training objective uses reconstruction based max-margin loss to predict the original input x i : The chart filling procedure of DIORA is used to extract binary unlabeled parse trees. It uses the CYK algorithm to find the maximal scoring tree in a greedy manner. For each cell of the parse table, the algorithm computes the span (i, j) with the maximal net compatibility score, computed recursively by summing the maximum compatibility score e(a, b) for each constituent of the span.

Our Approach to Rule Augmentation
Our goal is to learn to produce constituency parse trees using input sentences alone and in the absence of ground truth parse trees. We introduce a rule-augmented unsupervised model that leverages generic (potentially domain agnostic) production rules of the language grammar to infer constituency trees. Since most grammar rules for constituency parsing are generic, designing them can be a onetime effort, while being able to leverage their benefits across domains as background knowledge (as we will see in our experiments in Section 4). As described in Section 2.1, the induction of latent trees in DIORA is based on a CYK-like parsing algorithm that uses the compatibility scores e(i, j) at each cell to merge two constituents in the final tree. We impart supervision through the production rules of English language grammar.
For each sentence, we associate CFG production rules with constituents i and j in a CYK parse table format. We curate a set of domain-agnostic rules of the form X → Y Z and a dictionary of the form X → x, where X, Y, Z are non-terminals while x is a terminal. Concretely, X represents constituent tags such as S, NP, VP, etc., while x represents words in the vocabulary. Using the CYK parsing algorithm on our rule set and each sentence, we first determine which rules are triggered at each cell for a particular sentence. Whenever a rule r is triggered for a span (i, j), we weakly associate label δ (i,j) (r) = 1 otherwise 0. We use these weak labels to guide the rule scores l(i, j) for the constituents. Compatibility score observed for the span (i, j) is defined as : where r p are the learned weights associated with each of the production rules and P is the total number of rules. The score sums to 1 over all spans belonging to a particular cell in the CYK parse table. Intuitively, we aim to align e(i, j) and l(i, j) score to maximize the agreement between model and rules. We note that we use rules only to augment the training objective, and our inference procedure is identical to that of DIORA.

Training Objective
We learn a model that minimizes the overall loss L that is a composition of the reconstruction loss L rec and the rule based agreement loss or L rule : L = L rec + λL rule . We propose two alternatives for the loss function L rule . Cross entropy (CE) -For each cell k in the CYK parse table, this loss (CE) tries to match the distribution (score) of e(i, j) induced by DIORA with the distribution l(i, j) induced by the background knowledge: Ranking Loss (RL) -We recall from Section 2.1 that the CYK algorithm finds the maximal scoring tree in a greedy manner based on the highest compatibility score e(i, j) among all spans. Since the final parse tree output by DIORA relies only on the relative order of the e(i, j) to decide which span to merge, we propose an alternative rule-based loss that aims to match the relative order induced by compatibility scores e(i, j) of DIORA at each cell with the order induced by the scores of the rules l(i, j) at that cell. We achieve this through a pairwise ranking loss defined as ) and p is index into the rule set. The set {k trig } consists of all spans which have at least one rule triggered in its cell. In cases where our rule set is not extensive enough, we would like our model's compatibility score to rely more on the reconstruction loss, and {k trig } ensures that a sparse rule set does not lead to bad performance.

Experiments
We evaluate our rule augmented model and compare it against baselines on the tasks of unsuper-  vised parsing, unsupervised segment recall, and phrase similarity.

Data
We evaluate our model on two data sets: The Wall Street Journal (WSJ) and MultiNLI. WSJ is an extraction of PennTree Bank (Marcus et al., 2002) containing human-annotated constituency parse trees. MultiNLI consists of Stanford generated parse trees (Manning et al., 2014) as the ground truth. MultiNLI is originally designed for evaluating NLI tasks, but is often also utilized to evaluate constituency parse trees. We train on the complete NLI dataset, which is a composition of the MultiNLI and SNLI train sets. We evaluate model performance on the MultiNLI dev set and WSJ test set (split 23) following the experimental setting and evaluation metrics in (Drozdov et al., 2019). Further details are provided in the appendix. We initialize our model with the trained weights of DIORA and evaluate on unsupervised constituency parsing and segment recall. We also perform the post-processing (PP) of generated trees by attaching the trailing punctuation to the root node, exactly as carried out by (Drozdov et al., 2019).

Rule Set
We consider two rule-sets: (i) Set of Handcrafted Rules (HR) consists of 2500 human created CNF production rules ii) To assess robustness of the ruleaugmented method to the preciseness of the rule set, we present comparison by instead using a set of Automated Rules' (AR) which consists of the 2500 most frequently occurring CNF production rules extracted from the trees of automatically (using the Stanford CoreNLP parser) parsed SNLI corpus. Further details about these rule sets can be found in the appendix. We also use a train-set specific dictionary containing the POS (part-of-speech) tags of words in the training vocabulary for the terminal CFG productions for CYK parsing.

Unsupervised Parsing
In Tables 1 and 2, we present comparison between different approaches on the MultiNLI dev set and WSJ test set. We observe that our rule augmented approach outperforms the state of the art with respect to the max-F1 score. registering a maximum increase of 3.4 and 3.1 F1 points over DIORA respectively. The HR trained models outperform DIORA on both datasets, demonstrating that rule creation is indeed a one-time process and independent of domain. We also report parsing scores of a fully supervised model SPINN from (

Constituency Segment Recall
In Table 3, we present the breakdown of constituent recall across the 6 most common types. Our approach achieves the highest recall across all the types and is the only model to perform effectively on SBAR and NP. Unlike other approaches, our approach consistently close to or the best recall score. We observe that rule augmentation using HR is more beneficial than AR with respect to precise evaluation measures such as Constituency, Segment Recall and Phrase Recall but yields smaller improvements than AR with respect to looser evaluation measures such as max F1 of Unsupervised Parsing. This can be possibly attributed to our observation that the extracted (most frequent) rules from SNLI, have (around 25%) higher coverage on the training set than HR, but appear to be semantically less precise.

Phrase Similarity
We also employed the phrase similarity strategy followed by (Drozdov et al., 2019). Phrase Similarity scores measures the models capability to learn meaningful representation for spans of the text. Generally, most models focus more on generating the tokens representation and then use some ad-hoc arithmetic operations to generate representation for the larger spans of text thus losing the essence of the context that ties the words of the span.
To evaluate on the phrase similarity task we consider two data sets of labeled phrases: 1) CoNLL 2000 (Tjong Kim Sang and Buchholz, 2000), which is a shallow parsed dataset and contains spans of verb phrases, noun phrases, preposition phrases etc., and 2) CoNLL 2012 (Pradhan et al., 2012) which is a named entity dataset containing 19 different entity types. For the evaluation routine, we first generated the phrase representation of labeled spans whose length is greater than one. Cosine similarity is then used to obtain the similarity score of it with respect to all other labeled spans. We then calculate if the label for that query span matches the labels for each of the K most similar other spans in the dataset.In Table 4 we report precision@K for both datasets and various values of K. The baseline numbers are reported using the weights of DIORA provided by the authors.

Conclusion
In this work, we leverage linguistically grounded and domain agnostic CFG rules for language to induce parse trees and representations of constituent spans. We show that our approach augmented with generic, linguistically grounded grammatical rules, is easily able to outperform previous methods on constituency parsing and obtain higher segment recall.