Improved Latent Tree Induction with Distant Supervision via Span Constraints

For over thirty years, researchers have developed and analyzed methods for latent tree induction as an approach for unsupervised syntactic parsing. Nonetheless, modern systems still do not perform well enough compared to their supervised counterparts to have any practical use as structural annotation of text. In this work, we present a technique that uses distant supervision in the form of span constraints (i.e. phrase bracketing) to improve performance in unsupervised constituency parsing. Using a relatively small number of span constraints we can substantially improve the output from DIORA, an already competitive unsupervised parsing system. Compared with full parse tree annotation, span constraints can be acquired with minimal effort, such as with a lexicon derived from Wikipedia, to find exact text matches. Our experiments show span constraints based on entities improves constituency parsing on English WSJ Penn Treebank by more than 5 F1. Furthermore, our method extends to any domain where span constraints are easily attainable, and as a case study we demonstrate its effectiveness by parsing biomedical text from the CRAFT dataset.


Introduction
Syntactic parse trees are helpful for various downstream tasks such as speech recognition (Moore et al., 1995), machine translation (Akoury et al., 2019), paraphrase generation (Iyyer et al., 2018), semantic parsing (Xu et al., 2020), and information extraction (Naradowsky, 2014). While supervised syntactic parsers are state-of-the-art models for creating these parse trees, their performance does not transfer well across domains. Moreover, new syntactic annotations are prohibitively expensive; the original Penn Treebank required eight years of annotation (Taylor et al., 2003), and expanding PTB annotation to a new domain can be ⇤ Equal contribution. the driver heading to evermore falls Span Constraint the driver heading to evermore falls ✓ Distant Supervision ❌ Unsupervised Learning Figure 1: An example sentence and parsing to illustrate distant supervision via span constraints. Top: The unsupervised parser predicts a parse tree, but due to natural ambiguity in the text the prediction crosses with a known constraint. Bottom: By incorporating the span constraint, the prediction improves and, as a result, recovers the ground truth parse tree. In our experiments, we both inject span constraints directly into parse tree decoding and separately use the constraints only for distant supervision at training time. We find the latter approach is typically more effective. a large endeavor. For example, the 20k sentences of biomedical treebanking in the CRAFT corpus required 80 annotator hours per week for 2.5 years, include 6 months for annotator training (Verspoor et al., 2011). However, although many domains and many languages lack full treebanks, they do often have access to other annotated resources such as NER, whose spans might provide some partial syntactic supervision. We explore whether unsupervised parsing methods can be enhanced with distant supervision from such spans to enable the types of benefits afforded by supervised syntactic parsers without the need for expensive syntactic annotations.
We aim to "bridge the gap" between supervised and unsupervised parsing with distant supervision through span constraints. These span constraints in-dicate that a certain sequence of words in a sentence form a constituent span in its parse tree, and we obtain these partial ground-truths without explicit user annotation. We take inspiration from previous work incorporating distant supervision into parsing (Haghighi and Klein, 2006;Finkel and Manning, 2009;Ganchev et al., 2010;Cao et al., 2020), and design a novel fully neural system that improves a competitive neural unsupervised parser (DIORA; Drozdov et al. 2019) using span constraints defined on a portion of the training data. In the large majority of cases, the number of spans constraints per sentence is much lower than that specified by a full parse tree. We find that entity spans are effective as constraints, and can readily be acquired from existing data or derived from a gazetteer.
In our experiments, we use DIORA as our baseline and improve upon it by injecting these span constraints as a source of distant supervision. We introduce a new method for training DIORA that leverages the structured SVM loss often used in supervised constituency parsing (Stern et al., 2017;Kitaev and Klein, 2018), but only depends on partial structure. We refer to this method as partially structured SVM (PS-SVM). Our experiments indicate PS-SVM improves upon unsupervised parsing performance as the model adjusts its prediction to incorporate span constraints (depicted in Figure 1). Using ground-truth entities from Ontonotes (Pradhan et al., 2012) as constraints, we achieve more than 5 F1 improvement over DIORA when parsing English WSJ Penn Treebank (Marcus et al., 1993). Using automatically extracted span constraints from an entity-based lexicon (i.e. gazetteer) is an easy alternative to ground truth annotation and gives 2 F1 improvement over DIORA. Importantly, training DIORA with PS-SVM is more effective than simply injecting available constraints into parse tree decoding at test time. We also conduct experiments with different types of span constraints. Our detailed analysis shows that entity-based constraints are similarly useful as the same number of ground truth NP constituent constraints. Finally, we show that DIORA and PS-SVM are helpful for parsing biomedical text, a domain where full parse tree annotation is particularly expensive.

Background: DIORA
The Deep Inside-Outside Recursive Autoencoder (DIORA;Drozdov et al., 2019) is an extension of tree recursive neural networks (TreeRNN) that does not require pre-defined tree structure. It depends on the two primitives Compose : R 2D ! R D and Score : R 2D ! R 1 . DIORA is bi-directionalthe inside pass builds phrase vectors and the outside pass builds context vectors. DIORA is trained by predicting words from their context vectors, and has been effective as an unsupervised parser by extracting parse trees from the values computed during the inside pass.
Inside-Outside Typically, a TreeRNN would follow a parse tree to continually compose words or phrases until the entire sentence is represented as a vector, but this requires knowing the tree structure or using some trivial structure such as a balanced binary tree. Instead of using a single structure, DIORA encodes all possible binary trees using a soft-weighting determined by the output of the score function. There are a combinatorial number of valid parse trees for a given sentence -it would be infeasible to encode each of them separately. Instead, DIORA decomposes the problem of representing all valid parse trees by encoding all subtrees over a span into a single phrase vector. For example, each phrase vector is computed in the inside pass according to the following equations: The outside pass is computed in a similar way: where 1 j<k is an indicator function that is 1 when the sibling span is on the right, and 0 otherwise (see Figure 2 in Drozdov et al., 2020 for a helpful visualization of the inside and outside pass).
Training DIORA is trained end-to-end directly from raw text and without any parse tree supervision. In our work, we use the same reconstruction objective as in Drozdov et al. (2019). For a sentence x, we optimize the probability of the i-th word x i using its context (x i ): where P (.) is computed use a softmax layer over a fixed vocab with the outside vector (h out i,i ) as input.
Parsing DIORA has primarily been used as an unsupervised parser. This requires defining a new primitive TreeScore : S(y) = P i,j,k2y s in i,j,k . A tree y can be extracted from DIORA by solving the search problem that can be done efficiently with the CKY algorithm (Kasami, 1965;Younger, 1967):

Injecting Span Constraints to DIORA
In this section, we present a method to improve parsing performance by training DIORA such that trees extracted through CKY are more likely to contain known span constraints.

Test-time injection: Constrained CKY
One option to improve upon CKY is to simply find span constraints and then use a constrained version of CKY (CCKY): where z is a set of known span constraints for x, g(y, z) measures how well the span constraints are satisfied in y, i.e. g(y, z) = P |z| 1 i=0 1(z i 2 y), and ✏ is an importance weight for the span constraint to guarantee the highest scoring trees are the ones that satisfy the most constraints. 1 Using CCKY rather than CKY typically gives a small boost to parsing performance, but has several downsides described in the remainder of this subsection. 1 To save space, we exclude ✏ hereafter.
Can overfit to constraints DIORA learns to assign weights to the trees that are most helpful for word prediction. For this reason, it is logical to use the weights to find the highest scoring tree. With CCKY, we can find the highest scoring tree that also satisfies the constraints, but this tree could be very different from the original output. Ideally, we would like a method that can incorporate span constraints in a productive way that is not detrimental to the rest of the structure.
Only benefits sentences with constraints If we are dependent on constraints for CCKY, then only sentences that have said constraints will receive any benefit. Ideally, we would like an approach where even sentences without constraints could receive some improvement.
Constraints are required at test time If we are dependent on constraints for CCKY, then we need to find constraints for every sentence at test time. Ideally, we would like an approach where constraints are only needed at the time of training.
Noisy constraints Occasionally a constraint disagrees with a comparable constituency parse tree. In these cases, we would like to have an approach where the model can choose to include only the most beneficial constraints.

Distant Supervision: Partially Structured SVM
To address the weaknesses of CCKY we present a new training method for DIORA called Partially Structured SVM (PS-SVM). 2 This is a training objective that can incorporate constraints during training to improve parsing and addresses the aforementioned weaknesses of constrained CKY. PS-SVM follows these steps: 1. Find a negative tree (y ), such as the highest scoring tree predicted by the model: 2. Find a positive tree (y + ), such as the highest scoring tree that satisfies known constraints: 3. Use the structured SVM with fixed margin to learn to include constraints in the output: Loss ↵ y y + NCBL 1 arg max y S(y) arg max y [S(y) + g(y, z)] MIN DIFFERENCE 1 arg max y S(y) arg max y [S(y) + g(y, z) + g(y, y )] RESCALE g(y + , y ) arg max y S(y) arg max y [S(y) + g(y, z)] STRUCTURED RAMP 1 arg max y [S(y) g(y, z)] arg max y [S(y) + g(y, z)] Table 1: Multiple variants of the Partially Structured SVM (PS-SVM) loss, J P S = ↵ · max(0, 1+S(y ) S(y + )), where z denotes constraint spans and g(y, z) = P |z| 1 i=0 1(z i 2 y).

Variants of Partially Structured SVM
The most straightforward application of PS-SVM assigns y + to be the highest scoring tree that also incorporates known constraints, and we call this NAIVE CONSTRAINT-BASED LEARNING (NCBL). The shortcoming of NCBL are similar to CCKY, y + may be drastically different from the initial prediction y and the model may overfit to the constraints. With this in mind, an alternative to NCBL is to find y + that is high scoring, satisfies the constraints, and has the minimal number of differences with respect to y . We refer to this approach as MIN DIFFERENCE.
The MIN DIFFERENCE approach gives substantial weight to the initial prediction y , which may be helpful for avoiding overfitting to the constraints, but simultaneously is very restrictive on the region of positive trees. In other constraint-based objectives for structured prediction, such as gradientbased inference (Lee et al., 2019), the agreement with constraints is incorporated as a scaling penalty to the gradient step size rather than explicitly restricting the search space of positive examples. Inspired by this, we define another alternative to NCBL called RESCALE that scales the step size based on the difference between y + and y . If the structures are very different, then only use a small step size in order to both prevent overfitting to the constraints and allow for sufficient exploration.
For margin-based learning, for stable optimization a technique known as loss-augmented inference assigns y to the be the highest scoring and most offending example with respect to the ground truth. When a full structure is not available to assign y + , then an alternative option is to use the highest scoring prediction that satisfies the provided partial structure. This approach is called STRUCTURED RAMP loss (Chapelle et al., 2009;Gimpel and Smith, 2012;Shi et al., 2021).
In Table 1 we define the 4 variants of PS-SVM. Variants that do not use loss-augmented inference have gradient 0 when y contains all constraints.

Experimental Setup
In this section, we provide details on data preprocessing, running experiments, and evaluating model predictions. In addition, code to reproduce our experiments and the model checkpoints are available on Github. 3

Training Data and Pre-processing
We train our system in various settings to verify the effectiveness of PS-SVM with span constraints. In all cases, we require access to a text corpus with span constraints. 4 Ontonotes (CoNLL 2012; Pradhan et al. 2012) consists of ground truth named entity and constituency parse tree labels. In our main experiment (see Table 2), we use the 57, 757 ground truth entities from training data as span constraints.
WSJ Penn Treebank (Marcus et al., 1993) consists of ground truth constituency parse tree labels. It is an often-used benchmark for both supervised and unsupervised constituency parsing in English. We also derive synthetic constraints using the ground truth constituents from this data.
MedMentions (Mohan and Li, 2019) is a collections of Pubmed abstracts that have been annotated with UMLS concepts. This is helpful as training data for the biomedical domain. For training we only use the raw text to assist with domain adaptation. We tokenize the text using scispacy.
The Colorado Richly Annotated Full Text (CRAFT) (Cohen et al., 2017) consists of biomedical journal articles that have been annotated with both entity and constituency parse labels. We use CRAFT both for training (with 18, 448 entity spans) and evaluation of our model's performance in the biomedical domain. We sample 3k sentences of training data to use for validation.

Automatically extracted constraints
We experiment with two settings where span constraints are automatically extracted from the training corpus using dictionary lookup in a lexicon. These settings simulate a real world setting where full parse tree annotation is not available, but partial span constraints are readily available.

PMI Constraints
We use the phrases defined in the vocab from Mikolov et al. (2013) as a lexicon, treating exact matches found in Ontonotes as constraints. The phrases are learned through word statistics by applying pointwise mutual information (PMI) to find relevant bi-grams, then replacing these bi-grams with a new special token representing the phrase -applied multiple times this technique is used to find arbitrarily long phrases.
Gazetteer We use a list of 1.5 million entity names automatically extracted from Wikipedia (Ratinov and Roth, 2009), which has been effective for supervised entity-centric tasks with both log-linear and neural models (Liu et al., 2019a). We derive constraints by finding exact matches in the Ontonotes corpus that are in the gazetteer. A lexicon containing entity names is often called a gazetteer.

Training Details
In all cases, we initialize our model's parameters from pre-trained DIORA (Drozdov et al., 2019). We then continue training using a combination of the reconstruction and PS-SVM loss. Given sentence x and constraints z, the instance loss is: For the newswire domain, we train for a maximum of 40 epochs on Ontonotes using 6 random seeds and use grid search, taking the best model in each setting according to parsing F1 on the PTB validation set. For biomedical text, since it is a shift in domain from the DIORA pre-training, we first train for 20 epochs using a concatenation of Med-Mentions and CRAFT data with only the reconstruction loss 5 (called DIORA ft for "fine-tune"). Then, we train for 40 epochs like previously mentioned, using performance on a subset of 3k random sentences from the CRAFT training data for early stopping. Hyperparameters are in Appendix A.2. 5 The training jointly with MedMentions and CRAFT is a special case of "intermediate fine-tuning" (Phang et al., 2018).

Evaluation
In all cases, we report Parsing F1 aggregated at the sentence level -F1 is computed separately for each sentence then averaged across the dataset. To be consistent with prior work, punctuation is removed prior to evaluation 6 and F1 is computed using the eval script provided by Shen et al. (2018). 7,8 In tables 2, 3, and 4 we average performance across random seeds and report the standard deviation.
Baselines In Table 2, we compare parsing F1 with four general purpose unsupervised parsing models that are trained directly from raw text. We also compare with Cao et al. (2020) that uses a small amount of supervision to generate constituency tests used for training -their model has substantially more parameters than our other baselines and is based on RoBERTa (Liu et al., 2019b).

Results and Discussion
In our experiments and analysis we aim to address several research questions about incorporating span constraints for the task of unsupervised parsing.

Is Constrained CKY sufficient?
A natural idea is to constrain the output of DIORA to contain any span constraints ( §3.1). We expect this type of hard constraint to be ineffective for various reasons: 1) The model is not trained to include constraints, so any predictions that forces their inclusion are inherently noisy; 2) Similar to (1), some constraints are not informative and may be in disagreement with the desired downstream task and the model's reconstruction loss; and 3) Constraints are required at test time and only sentences with constraints can benefit. We address these weaknesses by training our model to include the span constraints in its output using PS-SVM. This can be considered a soft way to include the constraints, but has other benefits including the following: 1) The model implicitly learns to ignore constraints that are not useful; 2) Constraints are not necessary at test time; and 3) The model improves performance even on sentences that did not have constraints.
The effectiveness of our approach is visible in Table 2 where we use ground truth entity boundaries as constraints. CCKY slightly improves upon DIORA, but our PS-SVM approach has a more substantial impact. We experiment with four variants of PS-SVM (described in §3.3) -RESCALE is most effective, and throughout this text this is the variant of PS-SVM used unless otherwise specified.

Real world example with low effort constraint collection
Our previous experiments indicate that span constraints are an effective way to improve unsupervised parsing. How can we leverage this method to improve unsupervised parsing in a real world setting? We explore two methods for easily finding span constraints (see Table 3). We find that PMI is effective as a lexicon, but not as much as the gazetteer. PMI provides more constraints than the gazetteer, but the constraints disagree more frequently with the ground truth structure and a smaller percentage of spans align exactly with the ground truth. The gazetteer approach is better than using CCKY with ground truth entity spans, despite using less than half as many con-  Table 3: Parsing F1 on PTB. The max F1 across random seeds is measured on the test set. The corresponding span recall is shown on the Ontonotes train and test data before (R pre ) and after (R post ) training. The first row shows DIORA performance. Following rows show performance using distant supervision. EM: Exact Match (percent of span constraints that are also constituents); C: Crossing (percent of span constraints that cross a constituent); n z : Number of span constraints. The constraint-based metrics are not applicable to DIORA and marked with ;.
straints that only align exactly with the ground truth nearly half the time. We use gazetteer in only the most naive way via exact string matching, so we suspect that a more sophisticated yet still high precision approach (e.g. approximate string match) would have more hits and provide more benefit. For both PMI and Gazetteer, we found that NCBL gave the best performance.

Impact on consistent convergence
We find that using constraints with PS-SVM considerably decreases the variance on performance compared with previous baselines. 9 This is not surprising given that latent tree learning (i.e. unsupervised parsing) can converge to many equally viable parsing strategies. By using constraints, we are guiding optimization to converge to a point more aligned with the desired downstream task.

Are entity spans sufficient as constraints?
Given that DIORA already captures a large percentage of span constraints represented by entities, it is somewhat surprising that including them gives any F1 improvement. That being said, it is difficult to know a priori which span constraints are most beneficial and how much improvement to expect. To help understand the benefits of different types of span constraints, we derived synthetic constraints using the most frequent constituent types  from ground truth parse trees in Ontonotes (see Figure 2). The constraints extracted this way look very different from the entity constraints in that they often are nested and in general are much more frequent. To make a more fair comparison we prevent nesting and downsample to match the frequency of the entity constraints (see Figure 2d). From these experiments, we can see NP or VP combined with other constraints usually lead to the best parsing performance (Figure 2c). This is the case even if DIORA had relatively low span recall on a different constraint type (Figure 2b). A reasonable hypothesis is that simply having more constraints leads to better performance, which mirrors the result that the settings with the most constraints perform better overall (Figure 2a). When filtered to match the shape and frequency of entity constraints, we see that performance based on NP constraints is nearly the same as with entities ( Figure 2d). This suggests that entity spans are effective as constraints with respect to other types of constraints, but that in general we should aim to gather as many constraints as possible.

Case Study: Parsing Biomedical Text
The most impactful domain for our method would be unsupervised parsing in a domain where full constituency tree annotation is very expensive, and span constraints are relatively easy to acquire. For this reason, we run experiments using the CRAFT corpus (Verspoor et al., 2011), which contains text from biomedical research. The results are summarized in Tables 4 and 5.
5.5.1 Domain Adaptation: Fine-tuning through Word Prediction Although CRAFT and PTB are both in English, the text in biomedical research is considerably different compared with text in the newswire domain. When we evaluate the pre-trained DIORA model on the CRAFT test set, we find it achieves 50.7 F1. By simply fine-tuning the DIORA model on biomedical research text using only the word-prediction objective (J rec ) we can improve this performance to 55.8 F1 (+5.1 F1; DIORA ft in Table 4). This observation accentuates a beneficial property about unsupervised parsing models like DIORA: for domain adaptation, simply continue training on data from the target domain, which is possible because the word-prediction objective does not require label collection, unlike supervised models.

Incorporating Span Constraints
We use the ground truth entity annotation in the CRAFT training data as a source of distant supervision and continue training DIORA using the PS-SVM objective. By incorporating span constraints this way, we see that parsing performance on the test set improves from 55.8 ! 56.8 (+1 F1).
For CRAFT, we used grid search over a small set of hyperparameters including loss variants and found that STRUCTURED RAMP performed best.  Performance by Sentence Type In Table 5 we report parsing results bucketed by sentence-type determined by the top-most constituent label. In general, across almost all sentence types, simply constraining the DIORA output to incorporate known spans boosts F1 performance. Training with the PS-SVM objective usually improves F1 further, although the amount depends on the sentence type.
Challenging NP-type Sentences We observe especially low span-recall for sentences with NP as the top-most constituent (Table 5). These are short sentences that exhibit domain-specific structure.
Here is a typical sentence and ground truth parse for that case: ((HIF -1↵) KO) -((skeletal -muscle) (HIF -1↵) knockout mouse) Various properties of the above sentence make it difficult to parse. For instance, the sentence construction lacks syntactic cues and there is no verb in the sentence. There is also substantial ambiguity with respect to hyphenation, and the second hyphen is acting as a colon. These properties make it difficult to capture the spans (skeletal -muscle) or the second (HIF -1↵) despite being constraints.

Parsing of PTB vs. CRAFT
As mentioned in §5.5.1, there is considerable difference in the text between PTB and CRAFT. It follows that there would be a difference in difficulty when parsing these two types of data. After running the parser from Kitaev and Klein (2018) on each dataset, it appears CRAFT is more difficult to parse than PTB. For CRAFT, the unlabeled parsing F1 is 81.3 and the span recall for entities is 37.6. For PTB, the unlabeled parsing F1 is 95.

Related Work
Learning from Partially Labeled Corpora Pereira and Schabes (1992) modify the insideoutside algorithm to respect span constraints. Similar methods have been explored for training CRFs (Culotta and McCallum, 2004;Bellare and Mc-Callum, 2007). Rather than modify the weight assignment in DIORA, which is inspired by the inside-outside algorithm, we supervise the tree predicted from the inside-pass. Concurrent work to ours in distant supervision trains RoBERTa for constituency parsing using answer spans from question-answering datasets and wikipedia hyperlinks (Shi et al., 2021). Although effective, their approach depends entirely on the set of constraints. In contrast, PS-SVM enhances DIORA, which is a model that outputs a parse tree without any supervision.
The span constraints in this work are derived from external resources, and do not necessarily match the parse tree. Constraints may conflict with the parse, which is why CCKY can be less than 100 span recall in Table 4. This approach to model training is often called "distant supervision" (Mintz et al., 2009;Shi et al., 2021). In contrast, "partial supervision" implies gold partial labels are available, which we explore as synthetic data ( §5.4), but in general do not make this assumption.
Joint Supervision An implicit way to incorporate constraints is through multi-task learning (MTL; Caruana, 1997). Even when relations between the tasks are not modeled explicitly, MTL has shown promise throughout a range of text processing tasks with neural models (Collobert and Weston, 2008;Swayamdipta et al., 2018;Kuncoro et al., 2020). Preliminary experiments with joint NER did not improving parsing results. This is in-line with DIORA's relative weakness in representing fine-grained entity types. Modifications of DIORA to improve its semantic representation may prove to make joint NER more viable.

Constraint Injection Methods
There exists a rich literature in constraint injection (Ganchev et al., 2010;Chang et al., 2012) . Both methods are based on Expectation Maximization (EM) algorithm (Dempster et al., 1977) where the constraint is injected in the E-step of calculating the posterior distribution (Samdani et al., 2012). Another line of work focuses injecting constraint in the M-step (Lee et al., 2019;Mehta et al., 2018) by reflecting the degree of constraint satisfaction of prediction as the weight of the gradient. Our approach is similar to Chang et al. (2012) as we select the highest scoring output that satisfies constraints and learn from it. PS-SVM RESCALE is based on Lee et al. (2019).
The aforementioned constraint injection methods were usually used as an added loss to the supervised loss function. In this work, we show that the distant supervision through constraint injection is beneficial for unsupervised setting as well.

Structural SVM with Latent Variables
The PS-SVM loss we introduce in this work can be loosely thought of as an application-specific instantiation of Structural SVM with Latent Variables (Yu and Joachims, 2009). Various works have extended Structural SVM with Latent Variables to incorporate constraints for tasks such as sequence labeling (Yu, 2012) and co-reference resolution (Chang et al., 2013), although none we have seen focus on unsupervised constituency parsing. Perhaps a more clear distinction is that Yu and Joachims (2009) focuses on latent variables within supervised tasks, and PS-SVM is meant to improve convergence of an unsupervised learning algorithm (i.e., DIORA).
Additional Related Work In Appendix A.3 we list additional work in unsupervised parsing not already mentioned.

Conclusion
In this work, we present a method for enhancing DIORA with distant supervision from span constraints. We call this approach Partially Structured SVM (PS-SVM). We find that span constraints based on entities are effective at improving parsing performance of DIORA on English newswire data (+5.1 F1 using ground truth entities, or +2 F1 using a gazetteer). Furthermore, we show PS-SVM is also effective in the domain of biomedical text (+1 F1 using ground truth entities). Our detailed analysis shows that entities are effective as span constraints, giving equivalent benefit as a similar amount of NP-based constraints. We hope our findings will help "bridge the gap" between supervised and unsupervised parsing.

Broader Impact
We hope our work will increase the availability of parse tree annotation for low-resource domains, generated in an unsupervised manner. Compared with full parse tree annotation, span constraints can be acquired at reduced cost or even automatically extracted.
The gazetteer used in our experiments is automatically extracted from Wikipedia, and our experiments are only for English, which is the language with by far the most Wikipedia entries. Although, similarly sized gazetteers may be difficult to attain in other languages, Mikheev et al. (1999) point out larger gazetteers do not necessarily boost performance, and gazetteers have already proven effective in low-resource domains (Rijhwani et al., 2020). In any case, we use gazetteers in the most naive way by finding exact text matches. When extending our approach to other languages, an entity recognition model may be a suitable replacement for the gazetteer.

A.1 Constraint Statistics
Here we report a detailed breakdown of span constraints and the associated constituent types. Compared with Shi et al. (2021), span constraints based on entities are less diverse with respect to constituent type. In future work, we plan to use their data combined with DIORA and PS-SVM training. Also, we hypothesize that RoBERTa would be effective as a data augmentation to easily find new constraints.

A.2 Hyperparameters
We run a small grid search with multiple random seeds. A.2.5 Why fine-tune?
To be resource efficient, we use the pre-trained DIORA checkpoint from Drozdov et al. (2019) and fine-tune it for parsing biomedical text. DIORA was trained for 1M gradient updates on nearly 2M sentences from NLI data, taking 3 days using 4x GPUs. MedMentions has ⇠40k training sentences, CRAFT has only ⇠40k, and our PS-SVM experiments run in less than 1 day using a single GPU.

A.3 Additional Related Work
In the main text, we mention the most closely related work for training DIORA with our PS-SVM objective. Here we cover other work not discussed. Unsupervised parsing has a long and dense history, and we hope this section provides context to the state of the field, our contribution in this paper, and can serve as a guide for the interested researcher.
History of unsupervised parsing over the last thirty years As early as 1990, researcher were using corpus statistics to induce grammar, not unlike how our span constraints based on PMI are derived (Brill et al., 1990) -at this point the Penn Treebank was still in progress of being annotated. Other techniques focused on optimizing sentence likelihood with probabilistic context-free grammars, although with limited success (Lari and Young, 1990;Carroll and Charniak, 1992;Pereira and Schabes, 1992). Later work exploited the statistics between phrases and their context (Clark, 2001;Klein and Manning, 2001), but the most promising practical progress in this line of work was not seen until 15+ years later.
In the mid 2010s, many papers were published about neural models for language that claimed to induce tree-like structure, albeit none made strong claims about unsupervised parsing. Williams et al. (2018) analyzed these models and discovered a negative result. Despite their tree-structured inductive bias, when measured against ground truth parse trees from the Penn Treebank these models did only slightly better than random and were not competitive with earlier work grammar induction. Shortly after, Shen et al. (2018) developed a neural language model with a tree-structured attention pattern and Htut et al. (2018) demonstrated its effectiveness at unsupervised parsing, the first positive result for a neural model. In quick succession, more papers were published with improve results and new neural architectures (Shen et al., 2019;Drozdov et al., 2019;Kim et al., 2019a,b;Cao et al., 2020, inter alia), some of which we include as baselines in Table 2. Perhaps one of the more interesting work was improved performance of unsupervised parsing with PCFG when parameterized as a neural model (Neural PCFG;Kim et al., 2019a). These results suggest that the modern NLP machinery has made unsupervised parsing more viable, yet it is still not clear which of the newly ubiquitous tools (word vectors, contextual language models, adaptive optimizers, etc.) makes the biggest impact.

Variety of approaches to unsupervised parsing
The majority of the models in the work reported above optimize statistics with respect to the training data (with Cao et al., 2020 as an exception), but many techniques have been explored by now towards the same end. Unsupervised constituency parsing can be done in a variety ways including: exploiting patterns between images and text (Shi et al., 2019), exploiting patterns in parallel text (Snyder et al., 2009), joint induction of dependency and constituency (Klein and Manning, 2004), iterative chunking (Ponvert et al., 2011), contrastive learning (Smith andEisner, 2005), and more.
Other constraint types In our work we focus on span constraints, especially those based on entities or automatically derived from a lexicon, and encourage those spans to be included in the model's prediction. Prior knowledge of language can be useful in defining other types of structural constraints.