Inducing Semantic Roles Without Syntax

Semantic roles are a key component of linguistic predicate-argument structure, but developing ontologies of these roles requires signiﬁ-cant expertise and manual effort. Methods exist for automatically inducing semantic roles using syntactic representations, but syntax can also be difﬁcult to deﬁne, annotate, and predict. We show it is possible to automatically induce semantic roles from QA-SRL, a scalable and ontology-free semantic annotation scheme that uses question-answer pairs to represent predicate-argument structure. By associating arguments with distributions over QA-SRL questions and clustering them in a mixture model, our method outperforms all previous models as well as a new state-of-the-art baseline over gold syntax. We show that our method works because QA-SRL acts as surrogate syntax , capturing non-overt arguments and syntactic alternations, which are central motivators for the use of semantic role labeling systems. 1


Introduction
Semantic role labeling (SRL) requires extracting propositional predicate-argument structure from language, i.e., who is doing what to whom. Applications of SRL include information extraction (Christensen et al., 2011), machine reading (Wang et al., 2015), and model analysis (Tenney et al., 2019;Kuznetsov and Gurevych, 2020), and semantic roles form the backbone of many more general meaning representations (Banarescu et al., 2013;Abend and Rappoport, 2013).
The primary challenge, and promise, for SRL systems is to distill syntactically variable surface structures into semantic predicate-argument structures from an ontology (Palmer et al., 2005;Baker  .14 What is given?
.07 Table 1: Roles for give produced by our final model. Core arguments are captured almost perfectly, exhibiting both passive and dative alternations. et al., 1998). However, ontologies and their associated training data require time and expertise to annotate and do not readily generalize to new domains, limiting their broad-coverage applicability. Prior work towards mitigating this problem includes unsupervised induction of semantic roles from syntactic representations (Lang and Lapata, 2010). However, the need for formal syntactic supervision retains some of the annotation and generalization difficulties of supervised SRL, and it has proven difficult to do much better than a simple syntactic baseline (Lang and Lapata, 2011). An alternative is to use an ontology-free annotation scheme like QA-SRL (He et al., 2015), which represents roles with natural language questions. While QA-SRL can be annotated at large scale (FitzGerald et al., 2018), many different QA-SRL questions may correspond to the same role, making it more The plane was diverting around weather formations over the Java Sea when contact with air traffic control (ATC) in Jakarta was lost.
wh aux subj verb obj prep obj2 ? Answer What was being diverted around ? weather formations What was diverting ? The plane What was being diverted ? The plane What was lost ? contact with air traffic control Where was something lost ? over the Java Sea Table 2: Example QA-SRL question-answer pairs from the development set of the QA-SRL Bank 2.0 (FitzGerald et al., 2018). Questions may be represented in a verb-agnostic way by recording the form of the verb in the verb slot (e.g., stem, past participle). Note that the syntax used in questions may differ from the syntax in the source sentence, for example in the above questions using diverted in its passive form.
difficult to use in downstream tasks. We show how to overcome this difficulty, by automatically inducing an ontology of semantic roles corresponding to clusters of QA-SRL questions (see Table 1 for an example clustering). We use a model to predict a distribution over QA-SRL questions associated with each argument in a corpus, and cluster them to maximize likelihood under a simple model we call a Hard Unigram Mixture. Our model can be effectively optimized both by EM and greedy methods, which affords the benefits of tunable hierarchical clustering without sacrificing scalability (Section 3).
Experiments in semantic role induction (Section 4) show that our method outperforms all previous methods in the literature, as well as a new state-of-the-art baseline over gold syntax. This is despite requiring no formal syntactic supervision or theory, where the formalism used by previous work is highly informative of gold standard semantic roles (Section 5). We also present a detailed analysis (Section 6) showing why our method works: QA-SRL acts as surrogate syntax, removing (role-irrelevant) syntactic variation in the source text such as that from non-overt arguments (e.g., phrases extracted from relative clauses), while itself exhibiting (role-relevant) syntactic alternations which capture the behavior of verbal predicates (Table 1). Taken together, these results paint a path towards on-the-fly, data-driven construction of useful, interpretable ontologies of semantic structure.

Task Setting
The input to our task is a set of natural language sentences, where a subset of the tokens are marked as predicates. Each predicate has a set of arguments, and each argument x corresponds to a set of spans x = {s 1 , . . . , s m } in the predicate's sentence. 2 An ontology of semantic roles is a set of frames (corresponding to semantic predicates), and each frame has a set of associated roles (corresponding to participants in the event or state denoted by its frame). There may also be a set of modifier roles (e.g., location or time) which can appear with any frame. In supervised semantic role labeling, each predicate in the input data must be assigned to one of the frames in a given ontology, and each of a predicate's arguments must be assigned roles from its frame (or modifier roles). In semantic role induction, our task is to produce both the ontology and these assignments.
We follow prior work (Lang and Lapata, 2010) in treating semantic role induction as a clustering problem and assuming a single frame per predicate lemma. 3 Given input data marked with predicates and their arguments, we cluster the arguments for each predicate into sets corresponding to semantic roles. We may then compare these clusters to gold labels using clustering metrics (Section 4.3).
2 Previous work (Lang and Lapata, 2010) assumes a syntactic dependency tree and marks each argument by its syntactic head, which allows for features based on argument lemmas and dependency paths. We instead assume sets of argument spans, but no syntax tree; this allows for features based on spans (such as QA-SRL questions). Both approaches are ways of featurizing the same gold arguments.
3 Some ontologies, like FrameNet (Baker et al., 1998), define frames that span multiple lemmas (e.g., buy and sell share a Commercial Transaction frame), whereas others like Prop-Bank (Palmer et al., 2005) use frames which are specific to each lemma, denoting something closer to word sense. In our case, assuming a single frame per lemma simplifies modeling and allows us to compare to previous work. However, modeling predicate sense is an important problem for future work, as we will suggest in Section 6.3.

Modeling
Our model treats each argument x as a set of counts of QA-SRL questions, 4 denoted φ(x). We produce these counts from a trained QA-SRL question generator (Section 3.1) and cluster them by maximizing their likelihood under a mixture model (Section 3.2) using a hybrid of flat and hierachical clustering (Section 3.3).

Generating QA-SRL Features
For each argument x of a predicate, we leverage a trained QA-SRL parser to generate pseudocounts φ(x) of simplified QA-SRL questions, which will form the input features for the clustering step.
Simplified QA-SRL Example QA-SRL questions are shown in Table 2. These questions contain information which is not directly relevant to semantic roles, such as tense, aspect, modality, and negation. Since this creates sparsity for our model, we remove it as a preprocessing step. In particular, we replace the aux and verb slot values with either is and past participle (for passive voice), _ and present (for active voice when subj is blank), or does and stem (for active voice when subj is present). We also replace all occurrences of who and someone with what or something.
Generating Question Counts Let p denote a predicate, s denote a span, and q denote a simplified QA-SRL question. To generate our question count vectors φ, we reproduce the QA-SRL question generator of FitzGerald et al. (2018), which generates a distribution P(q | p, s) over QA-SRL questions conditioned on a predicate p and answer span s in a sentence. This model uses a BiLSTM encoder, concatenating the output representations of span endpoints and feeding them into a custom LSTM decoder which models the QA-SRL slot values in sequence. We modify the model to use BERT (Devlin et al., 2019) features as input embeddings for the BiLSTM (details in Appendix A).
Recall from Section 2 that an argument x consists of a set of spans from its sentence. We generate question counts φ(x) ∈ R |q| ≥0 by taking the mean where R ≥0 denotes the nonnegative real numbers and |q| is the number of possible simplified QA-SRL questions. Since |q| is large, to make this tractable we approximate P(q | p, s) with beam search, using a sparse representation and assigning counts of 0 to questions outside the beam.

Objective
Let X = {x 1 , . . . , x n } be the set of input arguments for clustering. Our goal is a clustering C = {C 1 , . . . , C k } which is a partition of X. We model each argument's questions φ(x) as being drawn from a mixture model over latent roles, each corresponding to a cluster C ∈ C. We maximize likelihood under this model, which we call a Hard Unigram Mixture, with the addition of a connectivity penalty which encourages roles not to appear twice for the same predicate instance.
The Hard Unigram Mixture (HUM) Recall that φ : X → R d ≥0 assigns question pseudocounts to each x ∈ X. Let π denote a probability distribution over {1, . . . , k} and θ a distribution over {1, . . . , d}. We propose the Hard Unigram Mixture loss is the data likelihood and is the clustering likelihood, writing ||C|| for the sum of the φ counts in a cluster C. The data likelihood prefers more, smaller clusters, the clustering likelihood prefers fewer clusters, and λ is a hyperparameter that trades off between them. 5 Connectivity Penalty Let p(x) denote the predicate instance corresponding to an argument x. We propose a connectivity penalty Here, L HUM 1 is equivalent to the negative log likelihood under the maximum likelihood estimate of a mixture of unigrams model (Nigam et al., 2000) constrained to hard assignments C; hence the name Hard Unigram Mixture. Further theoretical and empirical comparison to prior work is provided in Appendix G.
where δ is the indicator function, which discourages clusterings where multiple arguments of the same predicate instance are assigned the same role. This assumption has also been leveraged by prior models (Lang and Lapata, 2011;Titov and Klementiev, 2012).

Loss Function Our full loss is then
with the single hyperparameter λ.

Hybrid Clustering
We optimize L λ in three steps: flat pre-clustering, greedy merging, and tuned splitting. This approach provides us with both the efficiency benefits of flat clustering and the relative determinism, interpretability and tunability of hierarchical clustering.
Flat Pre-Clustering For pre-clustering, we minimize L 0 via hard EM. To avoid likelihoods of 0 in L HUM 0 , we smooth our estimates of θ using a Dirichlet prior. To optimize L cp via EM, we draw x 1 from the previous iteration's clustering in order to compute the contribution of each x 2 to the loss. With sufficiently large k, this can produce a high-precision clustering in O(nk) time to serve as input to the merging step.
Greedy Merging After pre-clustering, we produce a binary cluster tree by iteratively merging pairs of clusters which greedily minimize L 0 . Since λ = 0, the loss grows monotonically when merging clusters. The loss at each merge can be efficiently updated by maintaining maximum likelihood estimates θ for each cluster.
Tuned Splitting Finally, we iteratively split the cluster tree produced by the merging stage. At each step, we split the cluster C i with the lowest log data likelihood per item log P(C i |C) |C i | . We then choose the clustering which minimizes L λ , with λ > 0 tuned during model development. 6 continuation (C-) roles, keeping only verbal predicates, 7 and using the development set for model development and the training set for testing.
Our one preprocessing difference from previous work is that instead of using the dependencybased SRL annotations provided in the CoNLL 2008 dataset, we use full answer spans, which we reconstruct by aligning the CoNLL 2008 data back to the original annotations in the Penn Treebank (Marcus et al., 1993) and PropBank. 8

Models
HUM of QA-SRL Questions (HUM-QQ) We train a QA-SRL parser on the expanded set of the QA-SRL Bank 2.0 (FitzGerald et al., 2018) using the architecture described in Section 3.1. In the preclustering step, we estimate k = 100 clusters. For tuned splitting, we choose λ to maximize performance on the development set. Hyperparameters are detailed in Appendix B.
SYNTF This model assigns each argument to a cluster corresponding to the label of its syntactic dependency to its parent, using the syntactic formalism provided in CoNLL 2008 Shared Task data. Past work has found SYNTF to be a strong baseline (Lang and Lapata, 2011).
Prior Work We compare to Bayesian generative modeling (Titov and Klementiev, 2012, BAYES), which is state-of-the-art on gold syntax, and an embedding-based method (Luan et al., 2016, SYMDEP/ASYMDEP) which is state-of-the-art using automatic syntax. These as well as all other prior approaches (e.g., Lang and Lapata, 2011;Titov and Khoddam, 2015;Woodsend and Lapata, 2015) crucially rely on syntactic features.

Auxiliary Clustering Rules
For SYNTF and HUM-QQ, we experiment with several auxiliary clustering rules.
Lexical Rules We employ three lexical rules, each producing a separate cluster for all arguments whose spans exactly match a phrase contained in the rule's lexicon. Our rules are for negation (5 phrases), modals (23 phrases), and discourse modifiers (55 phrases). These lexica were written to correspond to the AM-NEG, AM-MOD, and AM-DIS roles on the basis of the PropBank annotation guidelines (Babko-Malaya, 2005) and development set. 9 Passive to Active Conversion We also propose a syntactic rule that applies only to SYNTF, where we the transform the dependencies as follows: • The LGS label, meaning "logical subject," is a dependency label given for by-phrases modifying a passive verb whose object denotes what is normally the subject of the verb's active form (Surdeanu et al., 2008). We change this to SBJ.
• Passive voice can be detected when the predicate verb is in past participle form (part-ofspeech tag VBN) and its syntactic parent is a be-verb (part of speech VC, lemma "be"). In these cases, we change the syntactic label of any SBJ dependents into OBJ.

Metrics
Purity/Collocation To compare with previous work, we follow Lang and Lapata (2010) in using purity and collocation based F1 score for our main evaluation. Purity measures cluster homogeneity: it assigns to each cluster the gold label for which it has the most points, and then measures the proportion of points which have their cluster's assigned label. Collocation measures cluster concentration: it assigns each gold label to the cluster which contains the most of its points, and then measures the proportion of points which are in their gold label's assigned cluster. These are calculated independently for each verb and averaged, weighing each verb by its number of argument instances. The harmonic mean of the final results is reported as an F1 score.
B 3 For deeper analysis, we use the B 3 (B-cubed) family of clustering metrics (Bagga and Baldwin, 1998). B 3 precision and recall are the precision and recall of each point's predicted cluster against its gold cluster, averaging over points. In comparison to purity and collocation, these metrics are tougher and more discriminative between clusterings, respecting important constraints like the cluster completeness constraint of Rosenberg and Hirschberg (2007), among others (Amigó et al., 9 Full lexica for these rules are provided in Appendix C.  Table 3: Main results. The addition of a few simple rules to the SYNTF baseline puts it significantly above existing approaches, and incorporating these rules into our QA-SRL-based model pushes performance even further, despite not using gold syntax at all. Evaluation numbers for baselines besides SYNTF are drawn directly from prior work. 2009). B 3 also allows us to reliably report scores along slices of the data for analysis purposes, as well as account for each slice's contribution to the total error. We report full B 3 results for our models in Appendix F and encourage future work to use these as the primary metrics.

Results
Main results are shown in Table 3. Our auxiliary rules put SYNTF significantly above the state of the art for gold syntax (with 85.2 F1 versus 83.0). HUM-QQ surpasses it with 87.1 F1 in the best case, despite not using gold syntax at all.

A Stronger Syntactic Baseline
For SYNTF, the addition of either lexical (negation, modal, and discourse) rules or the passive-to-active conversion produce competitive models, covering over 75% of the gap from baseline to BAYES. Used together, our rules bring the score to 85.2 F1, surpassing BAYES by 2.2 points.  roles, with the discourse rule providing significant improvements as well. In contrast, previous models have struggled with these roles, as reported by Lang and Lapata (2011, Table 4, NEG and DIS roles). However, this is better seen as a shortcoming of the evaluation than the models: these roles are relatively uninteresting from the perspective of semantic role induction, as they are closed-class, not specific to particular predicates, and don't correspond to a semantic argument or modifier of the event denoted by the predicate. It might have been reasonable to exclude these arguments from the task at the outset, but instead, using our rules can mostly account for them while maintaining some comparability to prior work.
The passive-to-active conversion also produces a sizable gain, particularly on the core roles A0 and A1 (Table 5). Titov and Klementiev (2012) informally note that the BAYES model learns some syntactic alternations; of these, the passive alternation is perhaps the most impactful as it can apply to any transitive verb. What we've found is that a simple rule accounting for the passive construction in the syntax provided to the BAYES model can account for a large majority of its gains.
These results provide extra context in which to interpret the existing literature on semantic role induction. The fact that our simple auxiliary rules bring the syntactic baseline beyond the existing state of the art raises questions about whether the performance differences between previously published models are due to their relative abilities in capturing their intended phenomena -such as selectional restrictions and distributions over argument heads (Lang and Lapata, 2014) -or capturing these rules. It is not clear how much of the 5.2 F1 gain over SYNTF from our auxiliary rules is redundant with previous models. It seems likely that applying our rules to them would produce a result competitive with HUM-QQ, but it would still rely on gold syntax. Our focus is the utility of QA-SRL as features; indeed, it is also conceivable that apply-  ing a hierarchical model like BAYES to QA-SRL features would bring further improvements as well.

Superiority Without Syntax
HUM-QQ benefits disproportionately from the lexical rules, with a 5 F1 gain as opposed to the 2.8 F1 gain for SYNTF. This is because PropBank's NEG, MOD, and DIS arguments almost never occur in QA-SRL, so they get nonsense questions from the model (see Appendix J, Table 12). 10 However, even the baseline model with no lexical rules or connectivity penalty surpasses the performance of the baselines using automatic syntax, all of which fall short of SYNTF on gold. 11 With these additions, HUM-QQ sets a new state of the art beyond our enhanced SYNTF baseline, with 87.1 F1. Table 4 compares our model to SYNTF + lex on the most common roles using B 3 . HUM-QQ greatly improves over SYNTF on core arguments (73→85 F1), but performs worse on modifiers (74→61). Since core arguments make up 74% of arguments in the corpus, HUM-QQ brings a large improvement overall (74→82) and core arguments still account for a majority of its error (at 61%).
SYNTF's high performance on modifiers can be traced back to representational choices in the CoNLL 2008 Shared Task syntax (Surdeanu et al., 2008), which uses several dependency types that are semantic in nature, such as TMP, LOC, MNR, and DIR, among others. These often correlate well with gold modifier role labels, especially TMP (87 F1) and LOC (81 F1). 12 This fact has led some prior work, e.g., Titov and Klementiev (2012), to use these dependency labels as clusters directly, so as to avoid the need to model modifier roles and instead focus on core arguments. Since we eschew syntactic features, we are forced to recover PropBank modifier roles from the ground up, making the task more difficult (explored more in Section 6.2).
6 What does QA-SRL Encode About Semantic Roles?
Semantic roles are traditionally characterized as abstractions over syntactic arguments and modifiers (Gruber, 1965;Fillmore, 1968). Despite their deep entanglement with syntax, we have found that significant improvements in semantic role induction are possible without explicit syntactic analysis of the sentence, instead leveraging distributions of QA-SRL questions for each argument. In this section, we show that this is because QA-SRL questions provide surrogate syntax, recapitulating the aspects of syntax that are important for semantic roles (Section 6.1). Where QA-SRL questions fail to capture aspects of PropBank semantic roles, this arises in part from ontological differences with PropBank on modifiers (Section 6.2) and limitations of our experimental setup ignoring predicate sense (Section 6.3).

Surrogate Syntax
HUM-QQ brings the largest improvement over SYNTF on core arguments A0 and A1. To investigate this, we identify the verbs which saw the greatest increase in B 3 F1 score on each role individually. What we find is that QA-SRL works by acting as surrogate syntax: it removes much of the (role-irrelevant) syntactic variation in the source text, while still exhibiting (role-relevant) syntactic alternations which capture the syntactic behavior of the predicate verb.
Reducing Syntactic Variation For A0, the three verbs with the greatest improvement from SYNTF to HUM-QQ are compete, conduct, and connect, all with gaps of over 40 F1. 13 For each of these, their A0 arguments have a wide range of syntactic functions assigned by SYNTF, with SBJ less than 50% of the time -despite the fact that where the A0 role is present, it is designed to correspond to the grammatical subject (Babko-Malaya, 2005). We found that this is because these verbs frequently have non-overt subjects, which are not direct syntactic dependents of the predicate in CoNLL 2008 syntax (74% of a random sample of 30 sentences with A0 arguments of these three verbs, 10 from each; see Appendix H.1). They appear in phrases like 'two competing competing competing competing competing competing competing competing competing competing competing competing competing competing competing competing competing objectives' (with adjectival clauses), 'urging directors to conduct conduct conduct conduct conduct conduct conduct conduct conduct conduct conduct conduct conduct conduct conduct conduct conduct a fair auction' (with control verbs), or 'a maze of halls that connects connects connects connects connects connects connects connects connects connects connects connects connects connects connects connects connects film rooms' (with relative clauses). In these cases, the SYNTF baseline does poorly, as the correspondence between the SBJ dependency and A0 role only holds consistently for overt subjects.
In contrast, HUM-QQ assigns the vast majority of A0 arguments in these cases with questions that put the wh-word in subject position, e.g., What competes with something? or What conducts? Here, QA-SRL removes much of the syntactic variation from the source text and recovers something close to the underlying grammatical relation between the argument and the verb, while also providing information about the verb's subcategorization frames (e.g., the presence of an object in What connects something?), aiding in recovery of the semantic role.
Capturing Syntactic Alternations For A1, The verbs with the greatest improvement are propose, prefer, price, and relate, with a gap of >50 F1 between models. Of the top 50 such verbs, 48 are transitive with A1 as the transitive object (see Appendix H.2). In these cases, the passive alternation allows the argument to be asked about in either the subject (What is proposed?) or object (What does something propose?) position. We find that QA-SRL does this, frequently combining questions  about passive subject and active object into one role: for 62% of the top 50 verbs, the cluster corresponding to A1 gives greater than 20% probability each to passive subject and active object questions. This happens because the Hard Unigram Mixture objective clusters together distributions whose uncertainty is spread over the same set of elements, which here correspond to syntactic alternations. As an example, Table 1 shows the induced clusters for give, which exhibit both passive and dative alternations; give gained 31 F1 on A1 in HUM-QQ.

Mismatched Modifiers
HUM-QQ struggles to identify PropBank modifier roles, and it has room for improvement on trailing arguments like A2 and A3. In QA-SRL, the semantics of these roles are primarily expressed by the initial wh-word, such as when, where, why, how, etc. Figure 1a shows the distribution of wh-words appearing for each role in the training set. To a large extent, each role is concentrated on a corresponding wh-word, but there are exceptions. A2, A3, and AM-ADV are widely spread between whwords, and how and why account for a significant portion of questions for several roles each. See Appendix J, Table 11 for full questions.
To visualize how this affects clustering results, Figure 1b shows the normalized pointwise mutual information (NPMI; Bouma, 2009) between gold labels in HUM-QQ's predicted clusters (see Appendix I for how this is calculated). While A0 and A1 are distinguished well from all other roles, the trailing arguments A2 and A3 are not well dis-tinguished from modifiers, reflecting the difficulty of the argument-adjunct distinction for these arguments, which often have similar meanings to modifiers and form a significant error case for supervised labelers (He et al., 2017). AM-ADV tends to be confused with other modifier roles, which reflects its definition in the PropBank guidelines as a sort of "catch-all" role for meanings not captured in the other modifiers (Babko-Malaya, 2005). Finally, AM-CAU (cause) and AM-PNC (purpose, not cause) tend to be confused with each other, since they both elicit why questions.

Argument-Adjunct Distinction
Scores are significantly lower for trailing core arguments A2-4 than for A0 and A1. Since part of the problem seems to be confusion with modifier roles (Figure 1b), we conduct an oracle experiment to enforce the argument-adjunct distinction by doubling the size of the feature space to φ(x) ∈ R 2|q| ≥0 and projecting gold core arguments and modifiers into orthogonal subspaces.
Results are shown in Table 6 (+ gold arg/adj). The oracle boosts performance by 3 points, with particular focus on trailing arguments A2 (69→78) and A4 (65→78), as well as modifiers AM-ADV (39→47), AM-MNR (50→57), and AM-LOC (55→61). However, overall performance on modifiers is still far below the syntactic baseline. Given the coarse semantics of English wh-words in comparison to PropBank modifier roles (Figure 1a), it may be that finer-grained features are necessary to significantly increase performance on modifiers.

Scrambled Senses
Despite core arguments significantly improving under HUM-QQ, they remain the largest source of error. To investigate this, we examine the verbs with the worst F1 on core arguments. The top verbs are go, settle, confuse, turn, and follow, with <60 F1. Half of the top 20 have 4 or more predicate senses annotated in PropBank, where different senses often manifest their roles differently: for example, the subject is A0 when settling with the IRS (sense 2), but A1 when settling into a new job (sense 3). To quantify this, we run an oracle experiment where we induce roles for each verb sense separately instead of each verb lemma. Results are shown in Table 6 (+ gold sense). Performance improves particularly on trailing arguments A2, A3 and A4, which tend to differ greatly in meaning and realization for different predicate senses. A combined oracle (+ both) shows that the gains are mostly complementary with those from the argument/adjunct distinction oracle. These results suggest that future work on semantic role induction should prioritize modeling predicate senses.

Conclusion
We have shown that QA-SRL provides a way to do state-of-the-art semantic role induction without the need for formal syntax. It works by providing surrogate syntax: it captures long-distance dependencies to non-overt arguments and exhibits syntactic alternations which allow us to detect varied ways of expressing the same role. These results suggest that QA-SRL can provide some of the practical benefits of sophisticated syntactic formalisms that have separate layers of functional structure, like Combinatory Categorial Grammar (Steedman, 1996(Steedman, , 2000, Head-Driven Phrase Structure Grammar (Pollard and Sag, 1994) While formal ontologies of semantic roles and syntax are difficult to formulate and scale, our results show how it may be comparatively feasible to formulate, scale, and build robust models for the phenomena that such ontologies are meant to explain. QA-SRL exhibits enough of these phenomena that a relatively simple model over it (the Hard Unigram Mixture in Section 3) yields stateof-the-art induced semantic roles which are interpretable and linguistically meaningful. This suggests that identifying and gathering supervision for more phenomena (e.g., those related to word sense or modifier semantics) in a relatively theoryagnostic way, then building models grounded in linguistic theory, may be a promising avenue for future work. This general approach has recently been applied to syntax as well, for example leveraging constituency tests (Cao et al., 2020) and naturallyoccurring bracketings (Shi et al., 2021).
The fact that discrete structures can be reliably derived from ontology-free annotation schemes like QA-SRL can potentially inform future efforts to construct large-scale ontologies of semantic structure. QA-SRL has the further benefit over traditional SRL of including a broader scope of implicit arguments than those addressed by supervised systems, as shown by Roit et al. (2020). Taken together, our results suggest that with the right kind of annotation scheme, it should be possible to construct rich semantic ontologies in new domains, without expert curation and in a data-driven, linguistically motivated way.

A QA-SRL Question Generator
We reproduce FitzGerald et al. (2018)'s architecture, encoding sentences with a stacked alternating LSTM (Zhou and Xu, 2015) with highway connections (Srivastava et al., 2015) and recurrent dropout (Gal and Ghahramani, 2016), and representing spans by concatenating the output embeddings of their endpoints (Lee et al., 2016). The question generator is a specialized LSTM decoder which only outputs the tokens allowed in each QA-SRL slot. The current predicate is indicated by an embedded binary feature input to BiLSTM encoder, and answer span representations are input at each step of the LSTM decoder. We make two changes from FitzGerald et al. (2018): 1) As opposed to GloVe (Pennington et al., 2014) or ELMo (Peters et al., 2018), We embed the inputs with BERT-base (Devlin et al., 2019) in the 'feature' style with a learned scalar mix over layers, and 2) we additionally concatenate the output embedding of the predicate to the input of the LSTM decoder.

B Hyperparameters
QA-SRL Question Generator The BiLSTM encoder uses a hidden size of 300, 4 layers, 0.1 recurrent dropout probability, and a 100-dimensional predicate indicator embedding. The LSTM decoder has a 100-dimensional hidden state and predicts QA-SRL slots with 200-dimensional embeddings via an MLP with a 100-dimensional hidden layer. We train on all QA pairs in the QA-SRL Bank 2.0 expanded training set using BERT's variant of Adam (Kingma and Ba, 2015) with a learning rate of 5e−5 and batch size of 32, selecting the model with minimal perplexity on the expanded development set. To produce our feature vectors φ, we decode questions with a beam size of 20 and a minimum probability cutoff of 0.01.
Flat Pre-Clustering We perform flat clustering with 100 clusters, skipping this step for verbs with 100 arguments or less. We use a concentration parameter of α = 0.01 (i.e., uniform base measure with a sum of 0.01) and do 5 random restarts, each running until the loss decreases by less than 1e−5 per iteration, and choose the run that yields the lowest loss.  Tuned Splitting Our final model (HUM-QQ + lex) uses λ = 0.35.

C Rule Lexica
Here we list the full lexica for the auxiliary clustering rules described in Section 4.2.
Discourse 55 items: after all, ah, also, and, and so, as a result, as we've seen before, as well, but, certainly, damn, either, for example, for instance, for one, for one thing, frankly, furthermore, gosh, hence, however, in addition, in any case, in any event, in contrast, in fact, in other words, in particular, in that case, in this case, in turn, indeed, instead, ironically, moreover, nonetheless, of course, oh gosh, oh my god, oh my gosh, on the other hand, or, particularly, rather, regardless, similarly, so, specifically, thereby, therefore, though, thus, too, uh, um. Note the inclusion of some interjections, (ah, oh my gosh, etc.), which are included in AM-DIS according to the PropBank guidelines (Babko-Malaya, 2005, p. 31).

D Auxiliary Rule Performance Breakdown
In Table 7, we provide a more detailed accounting of the improvements that arise from our auxiliary rules described in Section 4.2 and Table 5.  Setting Objective λ = 1 Mixture of Unigrams Likelihood λ = 0 Jensen-Shannon Divergence λ = −1 Mutual Information Table 9: Objectives reproduced by the HUM loss for different settings of λ, described in Appendix G.
The negation and discourse rules bring precision improvements, likely because they mostly have ADV dependencies outgoing. The modal rule improves both precision and recall because modals have many different kinds of outgoing dependencies, due to their status as heads of clauses (which can serve in many syntactic capacities). Finally, the passive alternation rule aids precision by splitting SBJ between active and passive uses, and aids recall by grouping LGS with the active SBJ and passive SBJ with active OBJ. This mainly affects the core argument labels A0 and A1, as shown in Table 5 -especially A1, as we also find for QA-SRL questions in Section 6.1.

E Tuned Splitting Evaluation
Our model has a single parameter λ which determines the number of clusters for each verb via the tradeoff between the data likelihood and clustering likelihood. We compare this to a constant baseline (the same number of clusters for all verbs) and an oracle upper bound which chooses the split that maximizes the purity/collocation F1 score for each verb independently. As shown in Table 8, we improve on the constant baseline by 1.8 points (85.3→87.1), but fall short of the oracle by 1.5 points (87.1→88.6). There is room for improvement, but errors in the tuning step may not be the most significant factor to concern future work.

F B 3 Results
Results using B 3 metrics on models we tested are shown in Table 10.  Table 10: B 3 Results on models we tested. The gap between HUM-QQ and SYNTF is larger than for purity and collocation, as B 3 is a tougher metric which is more discriminative between clusterings. The last model variant (+MI) is described in Appendix G.

G Related Clustering Algorithms
Recall the Hard Unigram Mixture loss Different settings of λ reproduce several objectives present in the literature, summarized in Table 9. As written in Section 3, when λ = 1, minimizing L HUM 1 maximizes likelihood of the data X under a mixture of unigrams model (Nigam et al., 2000).
When the number of clusters k is fixed, setting λ = 0 as in our greedy merging step (Section 3.3) is equivalent to enforcing a uniform prior π over mixture components. In this case, the gain in loss on each merge is the Jensen-Shannon Divergence (JSD) between the merged clusters, scaled by their total size and using each cluster's size to determine its mixing weights in the divergence, as in the mixture-based definition of JSD by Lin (1991). JSD is used in the same way by Chrupała (2012), without the scaling and weighting, as a similarity measure for agglomerative clustering.
Finally, setting λ = −1 reduces the HUM loss to the mutual information between the QA-SRL questions under φ and the cluster assignment C, which has been used in prior work to encourage informative clusterings . This is related to the distributional clustering paradigm of Pereira et al. (1993), which aims to identify common factors that explain distributional data, and which Slonim and Tishby (1999) frame in terms of an information bottleneck that maximizes mu-tual information between the data and a jointly distributed 'relevance' variable (though in our case, the reference variable is the cluster assignment itself). Setting λ = −1 in the greedy merging step, we find (in Table 10) that using a mutual information criterion in this way hurts performance. We guess this is because the objective incentivizes clusters of uniform size, which does not match the highly skewed distributions of gold semantic roles.
Pointwise mutual information (PMI) is a measure of how likely two items (such as tokens in a corpus) are to occur together relative to chance (Church and Hanks, 1989). One feature of PMI is that it tends to be larger for rare events: if two items x and y always occur together, then their PMI is − log P(x, y). This can make it difficult to assess association patterns among items with greatly varying probabilities (e.g., the AM-CAU role appears for 1% of arguments, while A1 appears for 27%). So we use normalized PMI (NPMI; Bouma, 2009), which factors out the effect of item frequency on PMI. Formally, the NPMI of x and y is log P(x, y) P(x) P(y) − log(P(x, y)) , (1) taking the limit value of -1 when they never occur together, 1 when they only occur together, and 0 when they occur independently. We use NPMI to analyze the co-occurrence of gold labels in predicted clusters: A pair of gold labels with high NPMI are preferentially grouped together by the induced roleset, whereas two labels with low NPMI are preferentially distinguished. The joint distribution between gold labels is generated by drawing one point (x) uniformly at random from the data, drawing another (y) uniformly at random from x's predicted cluster, and reading the gold labels of both. NPMI has been used to analyze clusters in this way by . Calculating NPMI naïvely on our full clustering has a caveat. The denominator of the PMI term in Equation 1, P(x) P(y), uses marginal probabilities of x and y over the corpus to calculate chance cooccurrence. But our clusters are constrained not to overlap between verbs, so this does not correctly estimate chance cooccurrence in our setting. Instead, we use the expectation over verbs of within-verb chance cooccurrence: where P(v) is proportional to the number of arguments for the verb v.

J Question Distributions by Role
We list the top questions and their probabilities for modifier roles in Table 11. Questions for core roles and the ones covered by our lexical rules are in Table 12. We use verb (or verbs, or verbed) as a placeholder for the verb, which in practice is replaced with the predicate for a given instance.  .08 How is something verbed?
.07 How does something verb?
.07 Why does something verb something?
.06 When does something verb?
.05 Table 12: Top questions in the QA-SRL features on the training set for core roles and the ones covered by our lexical rules. The questions for AM-NEG, AM-MOD, and AM-DIS often don't make sense, e.g., asking for the subject of the verb. No QA-SRL questions are appropriate or were annotated for many arguments of these types. On the other hand, the core roles behave essentially as expected: A0 is dominated by the subject, A1 has a mix of subjects and objects, with some complements, and A2 and on have a wider spread of different expressions. Since the core argument roles have predicate-specific meanings, the distributions here can only be interpreted as aggregates across many such meanings.