StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

There are two major classes of natural language grammar -- the dependency grammar that models one-to-one correspondences between words and the constituency grammar that models the assembly of one or several corresponded words. While previous unsupervised parsing methods mostly focus on only inducing one class of grammars, we introduce a novel model, StructFormer, that can simultaneously induce dependency and constituency structure. To achieve this, we propose a new parsing framework that can jointly generate a constituency tree and dependency graph. Then we integrate the induced dependency relations into the transformer, in a differentiable manner, through a novel dependency-constrained self-attention mechanism. Experimental results show that our model can achieve strong results on unsupervised constituency parsing, unsupervised dependency parsing, and masked language modeling at the same time.


Introduction
Human languages have a rich latent structure. This structure is multifaceted, with the two major classes of grammar being dependency and constituency structures. There has been an exciting breath of recent work targeted at learning this structure in a data-driven unsupervised fashion (Klein and Manning, 2002;Klein, 2005;Le and Zuidema, 2015;Shen et al., 2018c;Kim et al., 2019a). The core principle behind recent methods that induce structure from data is simple -provide an inductive bias that is conducive for structure to emerge as a byproduct of some self-supervised training, e.g., language modeling. To this end, a wide range of models have been proposed that are able to successfully learn grammar structures (Shen et al., 2018a,c;Wang et al., 2019;Kim et al., 2019b,a). However, most of these works focus on inducing either constituency or dependency structures alone.
In this paper, we make two important technical contributions. First, we introduce a new neural model, StructFormer, that is able to simultaneously induce both dependency structure and constituency structure. Specifically, our approach aims to unify latent structure induction of different types of grammar within the same framework. Second, StructFormer is able to induce dependency structures from raw data in an end-to-end unsupervised fashion. Most existing approaches induce dependency structures from other syntactic information like gold POS tags (Klein and Manning, 2004;Cohen and Smith, 2009;Jiang et al., 2016). Previous works, having trained from words alone, often requires additional information, like pre-trained word clustering (Spitkovsky et al., 2011), pre-trained word embedding (He et al., 2018), acoustic cues (Pate and Goldwater, 2013), or annotated data from related languages (Cohen et al., 2011).
We introduce a new inductive bias that enables the Transformer models to induce a directed dependency graph in a fully unsupervised manner. To avoid the necessity of using grammar labels during training, we use a distance-based parsing mechanism. The parsing mechanism predicts a sequence of Syntactic Distances T (Shen et al., 2018b) and a sequence of Syntactic Heights ∆ (Luo et al., 2019) to represent dependency graphs and constituency trees at the same time. Examples of ∆ and T are illustrated in Figure 1a. Based on the syntactic distances (T) and syntactic heights (∆), we provide a new dependency-constrained self-attention layer to replace the multi-head self-attention layer in standard transformer model. More specifically, the new attention head can only attend its parent (to avoid confusion with self-attention head, we use "parent" to denote "head" in dependency graph) or (a) An example of Syntactic Distances T (grey bars) and Syntactic Heights ∆ (white bars). In this example, like is the parent (head) of constituent (like cats) and (I like cats).
(b) Two types of dependency relations. The parent distribution allows each token to attend on its parent. The dependent distribution allows each token to attend on its dependents. For example the parent of cats is like. Cats and I are dependents of like Each attention head will receive a different weighted sum of these relations. Figure 1: An example of our parsing mechanism and dependency-constrained self-attention mechanism. The parsing network first predicts the syntactic distance T and syntactic height ∆ to represent the latent structure of the input sentence I like cats. Then the parent and dependent relations are computed in a differentiable manner from T and ∆.
its dependents in the predicted dependency structure, through a weighted sum of relations shown in Figure 1b. In this way, we replace the complete graph in the standard transformer model with a differentiable directed dependency graph. During the process of training on a downstream task (e.g. masked language model), the model will gradually converge to a reasonable dependency graph via gradient descent.
Incorporating the new parsing mechanism, the dependency-constrained self-attention, and the Transformer architecture, we introduce a new model named StructFormer. The proposed model can perform unsupervised dependency and constituency parsing at the same time, and can leverage the parsing results to achieve strong performance on masked language model tasks.

Related Work
Previous works on unsupervised dependency parsing are primarily based on the dependency model with valence (DMV) (Klein and Manning, 2004) and its extension (Daumé III, 2009;Gillenwater et al., 2010). To effectively learn the DMV model for better parsing accuracy, a variety of inductive biases and handcrafted features, such as correlations between parameters of grammar rules involving different part-of-speech (POS) tags, have been proposed to incorporate prior information into learning. The most recent progress is the neural DMV model (Jiang et al., 2016), which uses a neural network model to predict the grammar rule probabilities based on the distributed representation of POS tags. However, most existing unsupervised dependency parsing algorithms require the gold POS tags to ge provided as inputs. These gold POS tags are labeled by humans and can be potentially difficult (or prohibitively expensive) to obtain for large corpora. Spitkovsky et al. (2011) proposed to overcome this problem with unsupervised word clustering that can dynamically assign tags to each word considering its context. He et al. (2018) overcame the problem by combining DMV model with invertible neural network to jointly model discrete syntactic structure and continuous word representations.
Unsupervised constituency parsing has recently received more attention. PRPN (Shen et al., 2018a) and ON-LSTM (Shen et al., 2018c) induce tree structure by introducing an inductive bias to recurrent neural networks. PRPN proposes a parsing network to compute the syntactic distance of all word pairs, while a reading network uses the syntactic structure to attend to relevant memories. ON-LSTM allows hidden neurons to learn longterm or short-term information by a novel gating mechanism and activation function. In URNNG (Kim et al., 2019b), amortized variational inference was applied between a recurrent neural network grammar (RNNG) (Dyer et al., 2016) decoder and a tree structure inference network, which encourages the decoder to generate reasonable tree structures. DIORA (Drozdov et al., 2019) proposed using inside-outside dynamic programming to compose latent representations from all possible binary trees. The representations of inside and outside passes from the same sentences are optimized to be close to each other. The compound PCFG (Kim et al., 2019a) achieves grammar induction by maximizing the marginal likelihood of the sentences which are generated by a probabilistic context-free grammar (PCFG). Tree Transformer (Wang et al., 2019) adds extra locality constraints to the Transformer encoder's self-attention to encourage the attention heads to follow a tree structure such that each token can only attend on nearby neighbors in lower layers and gradually extend the attention field to further tokens when climbing to higher layers. Neural L-PCFG (Zhu et al., 2020) demonstrated that PCFG can benefit from modeling lexical dependencies. Similar to StructFormer, the Neural L-PCFG induces both constituents and dependencies within a single model.
Though large scale pre-trained models have dominated most natural language processing tasks, some recent work indicates that neural network models can see accuracy gains by leveraging syntactic information rather than ignoring it (Marcheggiani and Titov, 2017;Strubell et al., 2018). Strubell et al. (2018) introduces syntacticallyinformed self-attention that force one attention head to attend on the syntactic governor of the input token. Omote et al. (2019) and Deguchi et al. (2019) argue that dependency-informed selfattention can improve Transformer's performance on machine translation. Kuncoro et al. (2020) shows that syntactic biases help large scale pretrained models, like BERT, to achieve better language understanding.

Syntactic Distance and Height
In this section, we first reintroduce the concepts of syntactic distance and height, then discuss their relations in the context of StructFormer.

Syntactic Distance
Syntactic distance is proposed in Shen et al. (2018b) to quantify the process of splitting sentences into smaller constituents.
Definition 3.1. Let T be a constituency tree for sentence (w 1 , ..., w n ). The height of the lowest common ancestor for consecutive words x i and x i+1 isτ i . Syntactic distances T = (τ 1 , ..., τ n−1 ) are defined as a sequence of n − 1 real scalars that share the same rank as (τ 1 , ...,τ n−1 ).
In other words, each syntactic distance d i is associated with a split point (i, i + 1) and specify the relative order in which the sentence will be split into smaller components. Thus, any sequence of n − 1 real values can unambiguously map to an unlabeled binary constituency tree with n leaves through the Algorithm 1 (Shen et al., 2018b). As Shen et al. (2018c,a); Wang et al. (2019) pointed out, the syntactic distance reflects the information communication between constituents. More concretely, a large syntactic distance τ i represents that short-term or local information should not be communicated between (x ≤i ) and (x >i ). While cooperating with appropriate neural network architectures, we can leverage this feature to build unsupervised dependency parsing models.

Syntactic Height
Syntactic height is proposed in Luo et al. (2019), where it is used to capture the distance to the root node in a dependency graph. A word with high syntactic height means it is close to the root node. In this paper, to match the definition of syntactic distance, we redefine syntactic height as: Definition 3.2. Let D be a dependency graph for sentence (w 1 , ..., w n ). The height of a token w i in D isδ i . The syntactic heights of D can be any sequence of n real scalars ∆ = (δ 1 , ..., δ n ) that share the same rank as (δ 1 , ...,δ n ).
Although the syntactic height is defined based on the dependency structure, we cannot rebuild the original dependency structure by syntactic heights alone, since there is no information about whether a token should be attached to the left side or the right side. However, given an unlabelled constituent tree, we can convert it into a dependency graph with the help of syntactic distance. The converting process is similar to the standard process of converting constituency treebank to dependency treebank (Gelbukh et al., 2005). Instead of using the constituent labels and POS tags to identify the parent of each constituent, we simply assign the token with the largest syntactic height as the parent of each constituent. The conversion algorithm is described in Algorithm 2. In Appendix A.1, we also propose a joint algorithm, that takes T and ∆ as inputs and jointly outputs a constituency tree and dependency graph.

The relation between Syntactic Distance and Height
As discussed previously, the syntactic distance controls information communication between the two sides of the split point. The syntactic height quantifies the centrality of each token in the dependency graph. A token with large syntactic height tends to have more long-term dependency relations to connect different parts of the sentence together. In StructFormer, we quantify the syntactic distance and height on the same scale. Given a split point (i, i + 1) and it's syntactic distance δ i , only tokens (a) Model Architecture (b) Parsing Network Figure 3: The Architecture of StructFormer. The parser takes shared word embeddings as input, outputs syntactic distances T, syntactic heights ∆, and dependency distributions between tokens. The transformer layers take word embeddings and dependency distributions as input, output contextualized embeddings for input words.
x j with τ j > δ i can attend across the split point (i, i + 1). Thus tokens with small syntactic height are limited to attend to nearby tokens. Figure 2 provides an example of T, ∆ and respective dependency graph D.
However, if the left and right boundary syntactic distance of a constituent [l, r] are too large, all words in [l, r] will be forced to only attend to other words in [l, r]. Their contextual embedding will not be able to encode the full context. To avoid this phenomena, we propose calibrating T according to ∆ in Appendix A.2

StructFormer
In this section, we present the StructFormer model. Figure 3a shows the architecture of StructFormer, which includes a parser network and a Transformer module. The parser network predicts T and ∆, then passes them to a set of differentiable functions to generate dependency distributions. The Transformer module takes these distributions and the sentence as input to computes a contextual embedding for each position. The StructFormer can be trained in an end-to-end fashion on a Masked Language Model task. In this setting, the gradient back propagates through the relation distributions into the parser.

Parsing Network
As shown in Figure 3b, the parsing network takes word embeddings as input and feeds them into several convolution layers: s l,i = tanh (Conv (s l−1,i−W , ..., s l−1,i+W )) (1) where s l,i is the output of l-th layer at i-th position, s 0,i is the input embedding of token w i , and 2W +1 is the convolution kernel size.
Given the output of the convolution stack s N,i , we parameterize the syntactic distance T as: where τ i is the contextualized distance for the ith split point between token w i and w i+1 . The syntactic height ∆ is parameterized in a similar way:

Estimate the Dependency Distribution
Given T and ∆, we now explain how to estimate the probability p(x j |x i ) such that the j-th token is the parent of the i-th token. The first step is identifying the smallest legal constituent C(x i ), that contains x i and x i is not C(x i )'s parent. The second step is identifying the parent of the constituent x j = Pr(C(x i )). Given the discussion in section 3.2, the parent of C(x i ) must be the parent of x i . Thus, the two-stages of identifying the parent of x i can be formulated as: In StructFormer, C(x i ) is represented as constituent [l, r], where l is the starting index (l ≤ i) of C(x i ) and r is the ending index (r ≥ i) of C(x i ).
In a dependency graph, x i is only connected to its parent and dependents. This means that x i does not have direct connection to the outside of C(x i ). In other words, C(x i ) = [l, r] is the smallest constituent that satisfies: where τ l−1 is the first τ <i that is larger then δ i while looking backward, and τ r is the first τ ≥i that is larger then δ i while looking forward. For example, in Figure 2, δ 4 = 3.5, τ 3 = 4 > δ 4 and τ 8 = ∞ > δ 4 , thus C(x 4 ) = [4,8]. To make this process differentiable, we define τ k as a real value and δ i as a probability distribution p(δ i ). For the simplicity and efficiency of computation, we directly parameterize the cumulative distribution function p(δ i > τ k ) with sigmoid function: where σ is the sigmoid function, δ i is the mean of distribution p(δ i ) and µ 1 is a learnable temperature term. Thus the probability that the l-th (l < i) token is inside C(x i ) is equal to the probability that δ i is larger then the maximum distance τ between l and i: p(l ∈ C(x i )) = p(δ i > max(τ i−1 , ..., τ l )) (7) = σ((δ i − max(τ l , ..., τ i−1 ))/µ) Then we can compute the probability distribution for l: Similarly, we can compute the probability distribution for r: The probability distribution for [l, r] = C(x i ) can be computed as: Given probability p(j|[l, r]) and p([l, r]|i), we can compute the probability that x j is the parent of x i :

Dependency-Constrained Multi-head Self-Attention
The multi-head self-attention in the transformer can be seen as a information propagation mechanism on the complete graph G = (X, E), where the set of vertices X contains all n tokens in the sentence, and the set of edges E contains all possible word pairs (x i , x j ). StructFormer replace the complete graph G with a soft dependency graph D = (X, A), where A is the matrix of n × n probabilities. A ij = p D (j|i) is the probability of the j-th token depending on the i-th token. The reason that we called it a directed edge is that each specific head is only allow to propagate information either from parent to dependent or from from dependent to parent. To do so, structformer associate each attention head with a probability distribution over parent or dependent relation.
where w parent and w dep are learnable parameters that associated with each attention head, p parent is the probability that this head will propagate information from parent to dependent, vice versa. The model will learn to assign this association from the downstream task via gradient descent. Then we can compute the probability that information can be propagated from node j to node i via this head: where Q and K are query and key matrix in a standard transformer model and d k is the dimension of attention head. The equation is inspired by the scaled dot-product attention in transformer. We replace the original softmax function with a sigmoid function, so q i,j became an independent probability that indicates whether x i should attend on x j through the current attention head. In the end, we propose to replace transformer's scaled dotproduct attention with our dependency-constrained self-attention:

Experiments
We evaluate the proposed model on three tasks: Masked Language Modeling, Unsupervised Constituency Parsing and Unsupervised Dependency Parsing.
Our implementation of StructFormer is close to the original Transformer encoder (Vaswani et al., 2017). Except that we put the layer normalization in front of each layer, similar to the T5 model (Raffel et al., 2019). We found that this modification allows the model to converges faster. For all experiments, we set the number of layers L = 8, the embedding size and hidden size to be d model = 512, the number of self-attention heads h = 8, the feedforward size d f f = 2048, dropout rate as 0.1, and the number of convolution layers in the parsing network as L p = 3.

Masked Language Model
Masked Language Modeling (MLM) has been widely used as a pretraining object for larger-scale pretraining models. In BERT (Devlin et al., 2018) and RoBERTa , authors found that MLM perplexities on held-out evaluation set have a positive correlation with the end-task performance. We trained and evaluated our model on 2 different datasets: the Penn TreeBank (PTB) and BLLIP. In our MLM experiments, each token, including < unk > token, has an independent chance to be replaced by a mask token <mask>. The training and evaluation object for Masked Language Model is to predict the replaced tokens. The performance of MLM is evaluated by measuring perplexity on masked words.
PTB is a standard dataset for language modeling (Mikolov et al., 2012) and unsupervised constituency parsing (Shen et al., 2018c;Kim et al., 2019a). Following the setting proposed in Shen et al. (2018c), we use Mikolov et al. (2012)'s prepossessing process, which removes all punctuations, and replaces low frequency tokens with <unk>. The preprocessing results in a vocabulary size of 10001 (including <unk>, <pad> and <mask>). For PTB, we use a 30% mask rate.
BLLIP is a large Penn Treebank-style parsed corpus of approximately 24 million sentences. We train and evaluate StructFormer on three splits of BLLIP: BLLIP-XS (40k sentences, 1M tokens), BLLIP-SM (200K sentences, 5M tokens), and BLLIP-MD (600K sentences, 14M tokens). They are obtained by randomly sampling sections from BLLIP 1987-89 Corpus Release 1. All models are tested on a shared held-out test set (20k sentences, 500k tokens). Following the settings provided in (Hu et al., 2020), we use subword-level vocabulary extracted from the GPT-2 pre-trained model rather than the BLLIP training corpora. For BLLIP, we use a 15% mask rate. The masked language model results are shown in Table 1. StructFormer consistently outperforms our Transformer baseline. This result aligns with previous observations that linguistically informed selfattention can help Transformers achieve stronger performance. We also observe that StructFormer converges much faster than the standard Transformer model.

Unsupervised Constituency Parsing
The unsupervised constituency parsing task compares the latent tree structure induced by the model with those annotated by human experts. We use the Algorithm 1 to predict the constituency trees from T predicted by StructFormer. Following the experiment settings proposed in Shen et al. (2018c), we take the model trained on PTB dataset and evaluate it on WSJ test set. The WSJ test set is section 23 of WSJ corpus, it contains 2416 human expert labeled sentences. Punctuation is ignored during the evaluation.  (Shen et al., 2018c) 47.7 (1.5) Tree-T (Wang et al., 2019) 49.5 URNNG (Kim et al., 2019b) 52.4 C-PCFG (Kim et al., 2019a) 55.2 Neural L-PCFGs (Zhu et al., 2020) 55.31 StructFormer 54.0 (0.3)   Table 3: Fraction of ground truth constituents that were predicted as a constituent by the models broken down by label (i.e. label recall) the C-PCFG (Kim et al., 2019a) achieve a stronger parsing performance with its strong linguistic constraints (e.g. a finite set of production rules), Struct-Former may have a border domain of application. For example, it can replace the standard transformer encoder in most of the popular large-scale pre-trained language models (e.g. BERT and Re-BERTa) and transformer based machine translation models. Different from the transformer-based Tree-T (Wang et al., 2019), we did not directly use constituents to restrict the self-attention receptive field. But StructFormer achieves a stronger constituency parsing performance. This result may suggest that dependency relations are more suitable for grammar induction in transformer-based models. Table  3 shows that our model achieves strong accuracy while predicting Noun Phrase (NP), Preposition Phrase (PP), Adjective Phrase (ADJP), and Adverb Phrase (ADVP).

Unsupervised Dependency Parsing
The unsupervised dependency parsing evaluation compares the induced dependency relations with those in the reference dependency graph. The most common metric is the Unlabeled Attachment Score (UAS), which measures the percentage that a token is correctly attached to its parent in the reference tree.   (Tu and Honavar, 2012) 46.1 CS* (Spitkovsky et al., 2013) 64.4* Neural E-DMV (Jiang et al., 2016) 42.7 Gaussian DMV (He et al., 2018) 43.1 (1.2) INP (He et al., 2018) 47.9 (1.2) Neural L-PCFGs (Zhu et al., 2020) 40.5 (2.9) StructFormer 46.2 (0.4) w/ gold POS tags (for reference only) DMV (Klein and Manning, 2004) 39.7 UR-A E-DMV (Tu and Honavar, 2012) 57.0 MaxEnc (Le and Zuidema, 2015) 65.8 Neural E-DMV (Jiang et al., 2016) 57.6 CRFAE (Cai et al., 2017) 55.7 L-NDMV † (Han et al., 2017) 63.2 parsing papers, we noticed that our model sometimes output dependency structures that are closer to the CoNLL dependencies. Therefore, we report UAS and UUAS for both Stanford and CoNLL dependencies. Following the setting of previous papers (Jiang et al., 2016), we ignored the punctuation during evaluation. To obtain the dependency relation from our model, we compute the argmax for dependency distribution: and assign the k-th token as the parent of i-th token. Table 5 shows that our model achieves competitive dependency parsing performance while comparing to other models that do not require gold POS tags. While most of the baseline models still rely on some kind of latent POS tags or pre-trained word embeddings, StructFormer can be seen as an easy-to-use alternative that works in an end-to-end fashion. Table 4 shows that our model recovers 61.6% of undirected dependency relations. Given the strong performances on both dependency parsing and masked language modeling, we believe that the dependency graph schema could be a viable substitute for the complete graph schema used in the standard transformer. Appendix A.4 provides examples of parent distribution.
Since our model uses a mixture of the relation probability distribution for each self-attention head, we also studied how different combinations of relations affect the performance of our model. Table  4 shows that the model can achieve the best performance while using both parent and dependent relations. The model suffers more on dependency parsing if the parent relation is removed. And if the dependent relationship is removed, the model will suffer more on the constituency parsing. Appendix A.3 shows the weight for parent and dependent relations learnt from MLM tasks. It's interesting to observe that Structformer tends to focus on the parent relations in the first layer, and start to use both relations from the second layer.

Conclusion
In this paper, we introduce a novel dependency and constituency joint parsing framework. Based on the framework, we propose StructFormer, a new unsupervised parsing algorithm that does unsupervised dependency and constituency parsing at the same time. We also introduced a novel dependencyconstrained self-attention mechanism that allows each attention head to focus on a specific mixture of dependency relations. This brings Transformers closer to modeling a directed dependency graph. The experiments show promising results that Struct-Former can induce meaningful dependency and constituency structures and achieve better performance on masked language model tasks. This research provides a new path to build more linguistic bias into a pre-trained language model.  Table 6: The performance of StructFormer on PTB dataset with different mask rates. Dependency parsing is especially affected by the masks. Mask rate 0.3 provides the best and the most stable performance.