Learning Semantic Role Labeling from Compatible Label Sequences

Semantic role labeling (SRL) has multiple disjoint label sets, e.g., VerbNet and PropBank. Creating these datasets is challenging, therefore a natural question is how to use each one to help the other. Prior work has shown that cross-task interaction helps, but only explored multitask learning so far. A common issue with multi-task setup is that argument sequences are still separately decoded, running the risk of generating structurally inconsistent label sequences (as per lexicons like Semlink). In this paper, we eliminate such issue with a framework that jointly models VerbNet and PropBank labels as one sequence. In this setup, we show that enforcing Semlink constraints during decoding constantly improves the overall F1. With special input constructions, our joint model infers VerbNet arguments from given PropBank arguments with over 99 F1. For learning, we propose a constrained marginal model that learns with knowledge defined in Semlink to further benefit from the large amounts of PropBank-only data. On the joint benchmark based on CoNLL05, our models achieve state-of-the-art F1's, outperforming the prior best in-domain model by 3.5 (VerbNet) and 0.8 (PropBank). For out-of-domain generalization, our models surpass the prior best by 3.4 (VerbNet) and 0.2 (PropBank).


Introduction
Semantic Role Labeling (SRL, Palmer et al., 2010) aims to understand the role of words or phrases in a sentence.It has facilitated other natural language processing tasks including question answering (FitzGerald et al., 2018), sentiment analysis (Marasović and Frank, 2018), information extraction (Solawetz and Larson, 2021), and machine translation (Rapp, 2022).
Semantic role labeling can take various forms, each associated with different datasets.Predicates ˚Work done at the University of Utah.can be coarsely divided into PropBank (Palmer et al., 2005) senses, each with a core set of numbered semantic arguments (e.g., ARG0-ARG5).There are also modifier arguments (e.g., ARGM-LOC) typically representing information such as the location, purpose, manner or time of an event.Alternatively, predicates can also be hierarchically clustered into VerbNet (Schuler, 2005) classes according to similarities in their syntactic behavior.Each class admits a set of thematic roles (e.g.AGENT, THEME) whose interpretations are consistent with all predicates within the class.
As a modeling problem, SRL requires associating argument types and phrases with respect to an identified predicate.The two labeling tasks (i.e., VerbNet SRL and PropBank SRL) are closely related; but they differ in their treatment of predicates and have disjoint label sets.Learning jointly can improve data efficiency across the different labeling SRL tasks.
A common formulation used to instantiate this idea in prior work is multitask learning (e.g.Strzyz et al., 2019;Gung and Palmer, 2021): each label set is treated as a separate labeling task, and sometimes also modeled with inter-task feature interaction or consistency losses.While multitask learning often works well in such cases, the loss formulation represents a conservative view over label compatibilities of different tasks.At prediction time, subtask modules still run independently and are not constrained by each other.Consequently, decoded labels may violate structural constraints with respect to each other.In such settings, constrained inference (e.g., Fürstenau and Lapata, 2012;Greenberg et al., 2018) has been found helpful.However, this raises the question of how to involve such inference during learning for better data efficiency.Furthermore, given the wider availability of PropBankonly data (e.g., Pradhan et al., 2013Pradhan et al., , 2022)), how to efficiently benefit from such data also remains a question.
In this paper, we argue that the two disjoint but compatible labeling tasks can be more effectively modeled as one task using their compatibility structures that are already explicitly defined in the form of SEMLINK (Stowe et al., 2021).SEMLINK offers mappings between various semantic ontologies including PropBank and VerbNet.Gung and Palmer (2021) devised a deterministic conversion from PropBank label sequences to VerbNet ones using only the unambiguous mappings in SEMLINK.This conversion gives a test bed that has half of the predicates in CoNLL05 SRL dataset (Carreras and Màrquez, 2005) with both VerbNet and PropBank jointly labeled.
Given this setting, we propose a simple and effective joint CRF model for the VerbNet SRL and PropBank SRL tasks.In addition to the joint CRF, we propose an inference constraint that uses compatible label structures defined in SEMLINK, and show that our constrained inference achieves higher overall SRL F1-the average of VerbNet and ProbBank F1 scores-than the current stateof-the-art.Indeed, when PropBank labels are observed, it achieves over 99 F1 on the VerbNet SRL, suggesting the possibility of an automated annotation helper.We show that our formulation naturally extends to a constrained marginal model that learns from the more abundant PropBank-only data in a semi-supervised setting.When learning and predicting with constraints, it achieves even better SRL F1 in out-of-domain generalization. 1

Joint Task of Semantic Role Labeling
We consider modeling VerbNet (VN) and Prop-Bank (PB) SRL as a joint labeling task.Given a sentence x, we want to identify a set of predicates (e.g., verbs), and for each predicate, generate two sequences of labels, one for VerbNet arguments y V , the other for PropBank arguments y P .With respect to VN parsing, a predicate is associated with a VerbNet class that represents a group of verbs with shared semantic and syntactic behavior, thereby scoping a set of thematic roles for the class.Similarly, the predicate is associated with a PropBank sense tag that defines a set of PB core arguments along with their modifiers.A example is shown in Tab. 1.
We treat predicate classification and argument labeling as separate tasks and focus on the latter. 2ssuming predicates u and their associated VN classes η and PB senses σ are given along with x, we can write the prediction problem as: px, u, η, σq Ñ py V , y P q (1)

VerbNet Completion
There is a much larger amount of PropBank-only data (e.g., Pradhan et al., 2013Pradhan et al., , 2022) ) than jointly labeled data.Inferring VerbNet labels from observed PropBank labels, therefore, is a realistic use case.This corresponds to the modeling problem: We refer to this scenario as completion mode.In this paper, we will focus on the joint task defined in Eq. 1 while also generalizing our approach to address the completion task in Eq. 2.

Multitask learning and Its Limitations
When predicting multiple label sequences for SRL, a common approach is multitask learning using dedicated classifiers for each task that operate on a shared representation.The current state-of-the-art model (Gung and Palmer, 2021) used an LSTM stacked on top of BERT (Devlin et al., 2019) to model both PropBank and VerbNet.While each set of the semantic roles is modeled jointly with VerbNet predicates, the argument labeling of the two subtasks is still kept separate.Separate modeling of VerbNet SRL and Prop-Bank SRL has a clear disadvantage: subtask argument labels might disagree in three ways: 1) in terms of the BIO tagging scheme-e.g., a word having a B-* VN label and a I-* PropBank label, or 2) assigning semantically invalid label pairs-e.g., an ARGM-LOC being called a THEME, or 3) violating SEMLINK constraints.In Sec. 6, we show that a model with separate task classifiers, while having a close to state-of-the-art F1, can have a fair amount of argument assignment errors with respect to SEMLINK, especially for out-of-domain inputs.

A Joint CRF Model
To eliminate the errors discussed in Sec.2.2, we propose to model the disjoint SRL tasks using a joint set of labels.This involves converting multitask modeling into a single sequence labeling task whose labels are pairs of PB and VN labels.Doing so not only eliminates the BIO inconsistency, but also exposes an interface for injecting SEMLINK constraints.
Our model uses ROBERTA (Liu et al., 2019) as the backbone to handle textual encoding, similarly to the SRL model of Li et al. (2020).At a high level, we use a stack of linear layers with GELU activations (Hendrycks and Gimpel, 2016) to encode tokens to be classified for a predicate.For the problem of predicting arguments of a predicate u, we have an encoding vector ϕ u,i for the i-th word in the input text x.
e " mappROBERTApxqq ϕ u " tf ua prf u pe u q, f a pe i qsq , @ i P xu (4) Here, map sums up word-piece embeddings to form a sequence of word-level embeddings, the functions f u and f a are both linear layers, and f ua denotes a two-layer network with GELU activations in the hidden layer.We use a dedicated module of the form in Eq. 4 for the VN and PB subtasks.This gives us a sequence of vectors ϕ v u for VN and a sequence of vectors ϕ p u for PB.Next, we project the VN and PB feature sequences into a |Y V | ˆ|Y P | label space: Here, g is another two-layer GELU network followed by a linear projection that outputs |Y V | |Y P | scores, corresponding to VN-PB label pairs.The final result z u denotes a sequence of VN-PB label scores for a specific predicate u.In addition, we use a CRF as a standard first-order sequence model over z u (treating it as emission scores), and use Viterbi decoding for inference.The training objective is to maximize: where sp¨q denotes the scoring function for a label sequence that adds up the emission and the transition scores, and the term Zpxq denotes the partition that sums exponentiated scores over all label sequences.The term y V P denotes the label sequence that has VN and PB jointly labeled.We will refer to this model as the joint CRF, and the label sequence as the joint labels.
Reduced Joint Label Space.We use the crossproduct the two label sets, prefixed with a BIO prefix3 .A brute-force cross product leads to a |Y V | ˆ|Y P | label space.In practice, it is important to keep the joint label space at a small scale for efficient computation, especially for the CRF module.Therefore, we condense it by first disallowing pairs of the form (B-*, I-*) and predicateto-argument pairs.The former enforces that the VerbNet arguments do not start within ProbBank arguments, while the latter ensures that the predicate is not part of any argument.Next, we observe the co-occurence pattern of VN and PB arguments, disabling semantically invalid pairs such as (THEME, ARGM-LOC)4 .This reduces the label space by an order of magnitude (from 144 ˆ105 " 15, 120 to 685).
Input Construction using Predicates.We take inspiration from prior work (e.g.Zhou and Xu, 2015;He et al., 2017;Zhang et al., 2022) to explicitly put predicate features as part of the input to augment textual information.At the same time, we also seek to maintain a simple construction that can be easily adapted to a semi-supervised setting (i.e.compatible with PropBank-only data).To this end, we propose a simple solution that appends the PropBank senses of potential predicates to the original sentence x: x WP " rCLS w 1:T SEP σ 1:N SEPs where w 1:T denotes the input words, and σ 1:N denotes the senses of the N predicates.In practice, we use the PropBank roleset IDs which consist of a pair of (lemma, sense)-e.g., run.01.Our models only take the encodings for w 1:T after the ROBERTA encoder and ignore the rest.We consider this design to be more efficient than prior work (e.g.Gung and Palmer, 2021;Zhang et al., 2022) that dedicated text feature for each predicate.In our setup, the argument labeling for different predicates shares the same input, thus no need to run encoding multiple times for multiple predicates.
4 Semi-supervised Learning with PropBank-only Data Compared to data with both VerbNet and Prop-Bank fully annotated, there is more data with only PropBank labeled.The SEMLINK corpus helps in unambiguously mapping " 56% of the CoNLL05 data (Gung and Palmer, 2021).Therefore, a natural question is: can we use PropBank-only data to improve the joint task?
Here, we explore model variants based on the joint CRF architecture described in Sec. 3. We will focus on modeling for PropBank-only sentences.

Separate Classifiers for VN and PB
As a first baseline, we treat VN and PB as two separate label sequences during training.This is essentially a multitask setup where VN and PB targets use separate classifiers.We let these two classifiers share the same ROBERTA encoder, and have their own learnable weights for Eq.4-6.

Dedicated PropBank Classifier
Another option is to retain the joint CRF for the jointly labeled dataset and use an additional dedicated CRF for PB-only sentences.Note that this setup is different from the model in Sec.4.1.As before, we let these two to share the same encoder and, for Eq.4-6, they have dedicated trainable weights.
During inference, we rely on the Viterbi decoding associated with the joint CRF module to make predictions.In our preliminary experiments, the joint CRF and the dedicated PropBank CRF achieve similar F1 on PropBank arguments.

Marginal CRF
For partially labeled sequences, we take inspiration from Greenberg et al. (2018) to maximize the marginal distribution of those observed labels.In our joint CRF, the marginalization assumes uniform distribution over VN arguments that are paired with observed PB arguments.The learning objective is to maximize the probabilities of such label sequences as a whole: where LSE y p¨q " log ÿ y expp¨q where y P y P u denotes a potential joint label sequence with only PropBank arguments observed for a predicate u.Scores of such label sequences are aggregated by the LSE operator.Note that the marginal CRF and the joint CRF (Eq. 6) use the same model architecture, just with a different loss.

Marginal Model with SEMLINK (Marginal SEML )
The log marginal probability in Eq. 7 assumes uniform distribution over a large label space.It included any arbitrary VerbNet arguments paired to the observed PropBank labels.In practice, we can narrow it down to only legitimate VN-PB argument pairs defined in SEMLINK.Such legitimate space is uniquely determined by a VerbNet class η and PropBank sense σ.We will refer to label sequences that comply with this space as y SEML u , and apply it on Eq. 7: Note that this formulation essentially changes the global optimization into a local version which implicitly requires using SEMLINK at inference time.We will present the details of y SEML u in Sec. 5. Intuitively, it zeros out losses associated with joint label traces that violate SEMLINK constraints.During training, we found that it is important to apply this constraint to both B-* and I-* labels.
Where to apply y SEML u ?Technically, the summation over reduced label space can be applied at different places, such as the partition Z in Eq. 6.We will report performances on this setting in Sec.7.2.In short, plugging the label filter y SEML u to the joint CRF (therefore jointly labeled data) has little impact on F1 scores, thus we reserve it for the PropBank-only data (as in Eq. 8).

Inference with SEMLINK
Here we discuss the implementation of the y SEML u .Remember that each pair of VerbNet class η and PropBank sense σ uniquely determines a set of joint argument labels for the predicate u.For brevity, let us denote this set as SEMLpuq (e.g., Tab. 1).Eventually, we want the Viterbi-decoded label sequence to comply with SEMLpuq.That is, @pl V , l P q PSEMLpuq Ñ @i, " `@lRSEMLpuq py V i " l V , lq R y V P u @lRSEMLpuq pl, y P i " l P q R y V P u ˘ı (9) where i denotes the location in a sentence, y V P u is the joint label sequence, consisting of py V i , y P i q pairs.The constraint in Eq. 9 translates as: if a VerbNet argument is present in the predicate u's SEMLINK entry, we prevent it from aligning to any PropBank arguments not defined in SEMLINK; and the same applies to PropBank arguments.
This constraint can be easily implemented by a masking operation on the emission scores of the joint CRF, thus can be used at both training and inference time.During inference, it effectively ignores those label sequences with SEMLINK violations during Viterbi decoding: In Sec. 6, we will show that using Eq. 10 always improves the overall SRL F1 scores.
VerbNet Label Completion.For models based on our joint CRF, we mask out joint labels that are not defined in y P during inference, similar to Eq. 8.For models with separate VN and PB classifiers (in Sec.4.1), we enforce the VN's Viterbi decoding to only search arguments that are compatible with the gold y P in SEMLINK.Furthermore, we always use the constraint (Eq.9) in the completion mode.

Experiments
In this section, we aim to verify whether the compatibility structure between VerbNet and PropBank (in the form of SEMLINK) has a positive impact on their sequence labeling performances.

Data
We follow the prior state-of-the-art (Gung and Palmer, 2021) in extracting VerbNet labels from the CoNLL05 dataset using the SEMLINK corpus.We use the same version of SEMLINK to extract the data for training and evaluation.Therefore, our F1 scores are directly comparable with theirs (denoted as IWCS2021 in Table 2) The resulting dataset accounts for about 56% of the CoNLL05 predicates, across training, development and test sets (including WSJ and Brown).We will refer to this data as the joint data column in Table 2.For semi-supervised learning, we incorporate the rest of the PropBank-only predicates in the CoNLL05 training split.For development and testing, we use the splits in the joint dataset for fair comparison with prior work.

Training and Evaluation
We adopt the same fine-tuning strategy as in Li et al.
(2020)-we fine-tune twice since this generally outperforms fine-tuning only once, even with the same number of total epochs.In the first round, we finetune our model for 20 epochs.In the second round, we restart the optimizer and learning rate scheduler, and fine-tune for 5 epochs.In both rounds, checkpoints with the highest average VN/PB development F1 are saved.For different model variants, we report the average F1 from models trained with 3 random seeds.
For the SEMLINK constraint, we use the official mapping between VN/PB arguments5 .When involved in constrained inference, we use the gold VerbNet classes η and PropBank senses σ.
For evaluation, in addition to the standard VN and PB F1 scores, we also report the percent of predicates with predictions that are inconsistent with SEMLINK, denoted by ρ.

Performance on SEMLINK Extracted CoNLL05
We want to compare models trained on the joint dataset and variants (in Sec. 4) on the semisupervised setup.Table 2 presents their performances along with SEMLINK violation rates in model predictions.Note that the ground truth joint data bears no violation at all (i.e., ρ " 0).

Multitask involves SEMLINK violations.
Firstly, we show the limitations of multitask learning.While the architecture is simple, the testing scores mostly outperform the previous state-of-the-art, except the Brown PropBank F1. on Brown).While SEMLINK inconsistency is not reported in (Gung and Palmer, 2021), we believe that, due to the nature of multitask learning, SEMLINK errors are inevitable.
Joint CRF outperforms multitask learning.A direct comparison is between the Multitask v.s.Joint.Our joint CRF obtains higher overall SRL F1 across the WSJ and the Brown sets.A similar observation applies to the semi-supervised setting where Multitask compares to Joint+CRF PB .Most of such improvements are from the use of inference-time SEMLINK constraints.
Inference with SEMLINK improves SRL.In Table 3, we do side-to-side comparison of using versus not using the SEMLINK structure during inference.We do so for each modeling variant.With constrained inference, models no longer have SEM-LINK structural violations (ρ " 0).And this results in a clear trend where using SEMLINK systematically improves the F1 scores.We hypothesize this is due to the reduced search space which makes the decoding easier.Likely due to the higher granularity of VerbNet argument types compared to PropBank, a majority of the improvements are on the VN F1s.
Does semi-supervised learning make a difference?
The answer is that it depends.For Multitask, using PropBank-only data traded off a bit on the overall WSJ F1 but improved the out-of-domain performances.Accompanied with this trade-off is the slightly higher inconsistency rate ρ.The Joint+CRF P B model tells an opposite story that the partially labeled data is favorable on the indomain test but not so in the out-of-domain test.This observation is also consistent with both the Marginal CRF and constrained Marginal model (Marginal SEML ).Furthermore, when performance improves, the margins on VN and PB are fairly distributed.Finally, we should note that, neither the Joint+CRF PB nor the Marginal have a better Brown F1 than the Joint, meaning that they did not use the PB-only data efficiently.
Impact of marginal CRF.We compare the Joint+CRF PB to Marginal to see how a single CRF handles partially labeled data.The latter outperforms the former consistently by 0.2 on the indomain test set but performed slightly worse on the out-of-domain Brown set.Comparing to the Joint model, it seems that naively applying marginal CRF leads to even worse generalization.
Constrained marginal model improves generalization.We want to see if our constrained model can help learning.Modeling PropBank-only data with a separate classifier (i.e., the Multitask and Joint+CRF PB ) failed to do so (although they indeed work better on the in-domain WSJ).In contrast, our constrained Marginal SEML apparently learns from the PropBank-only data more efficiently, achieving strong in-domain performance and substantially better out-of-domain generalization.This suggests that learning with constraints works better with partially labeled data.Interestingly though, it seems that the constraint is optional for fully annotated data since the Marginal SEML only enables the constraint on PB-only data.We verify this phenomenon in Sec.7.2 with an ablation study.

Statistical Significance of Constrained Inference.
We measure statistical significance using a t-test implemented by Dror et al. (2018) on predictions from models in Table 3.For each model, we compare inference with and without SEMLINK constraints (Sec.5).For a fair comparison, we limit predictions from the model trained with the same random seed in each test, and apply the test for all random seeds (3 in total).We observe that the improvements on VerbNet F1's are universally significant.
The p-values are far less than 0.01 across different testing data, models, and random seeds.This aligns to the observation in Table 3 that VN F1 has a substantial F1 boost while PB F1 improvements tend to be marginal.
To look closer, we examined the predictions of a Joint model on the Dev set (1, 794 predicates), and found that, after using SEMLINK during inference, 51 wrongly predicted predicates in VN SRL were corrected (i.e., improved predicate-wise F1), and no predicates received a degraded F1.However, for PropBank SRL, there were 12 predicates corrected by the constraint while 6 became errors.

VerbNet Label Completion from PropBank
As discussed in Sec. 1, we also aim to address the realistic use case of VerbNet completion.Table 4 summarizes the performances for VerbNet argument prediction when gold PropBank arguments are given.In the completion mode, the Joint model performs generally better than all the semisupervised models.This phenomenon is likely because the Joint model is optimized for the probability P py V , y P | xq, while the semi-supervised models, in one way or the other, have a term for ř y V P py V , y P | xq on PB-only data.The latter term does not explicitly boost model's discriminative capability on the unique ground truth.
In addition to the x WP input construction in Sec. 3, we propose a special construction x COMP for the VN completion mode by using the PB arguments as text input.: x COMP " rCLS w 1:T SEP y P 1 ... σ v y P v`1 ... SEPs where y P i denotes the PropBank argument label for the i-th word.For the predicate word, we use the predicate feature (i.e.lemma and sense).Compared to x WP , this formulation makes the computation less efficient as the input x COMP is no longer shared across different predicates.However, it offers a more tailored input signal and delivers above 99 F1 on both WSJ and Brown.x COMP : Input construction for the completion mode.

Analysis
We report statistical metrics in Sec.7.1.In Sec.7.2, we analyze the use of constrained learning on the jointly labeled data.

Variance of SRL Models
A majority of F1s in Tab. 2 vary in a small range.Models trained on joint-only data show higher variance on the out-of-domain Brown test set.Among the semi-supervised models, the Marginal models exhibit high F1 variance on the Brown set while the Marginal SEML models work more stably.

Impact of Learning with SEMLINK Constraint on Joint Data
In Table 5, we use the form of constrained learning in Eq. 8 but apply it on the joint CRF loss over the jointly labeled training data.Note that the constraint term only affects the denominator part in Eq. 6.Overall, the effect of SEMLINK at training time seems small.work investigated inference with constraints (e.g.Punyakanok et al., 2004;Surdeanu et al., 2007;Punyakanok et al., 2008).Other work developed constrained models for learning (e.g.Chang et al., 2012) ; or incorporated constraints with emerging neural models (e.g.Riedel and Meza-Ruiz, 2008;Fürstenau and Lapata, 2012;Täckström et al., 2015;FitzGerald et al., 2015;Li et al., 2020).
VerbNet SRL, on the other hand, is often studied as a comparison or a helper for PropBank SRL (Kuznetsov and Gurevych, 2020).Yi et al. (2007) showed that the mapping between Verb-Net and PropBank can be used to disambiguate PropBank labels.It has been shown that model performance on VerbNet SRL is affected more by predicate features than PropBank SRL (Zapirain et al., 2008).In a sense, our observation that Verb-Net F1 gains larger improvements from SEMLINK is also consistent with prior work.Beyond comparison work, Kazeminejad et al. (2021) explored the downstream impact of VerbNet SRL and showed promising uses in entity state tracking.

Multitask Learning
The closest work to this paper is Gung and Palmer (2021).Instead of modeling argument labels and predicate classes via multitasking, we adopted a simpler design and focused on joint SRL argument labeling.This comes with two benefits: 1) a focused design that models joint labels; 2) an easy extension for using marginal CRF for partially labeled data.Other technical differences include a generally better transformer (i.e., ROBERTA) instead of BERT (Vaswani et al., 2017), simpler input construction, and our proposal of the completion mode.
Marginal CRF Greenberg et al. (2018) explored the use of marginal CRF on disjoint label sequences in the biomedical domain.Disjoint label sequences are concatenated into one, thus requiring dedicated decoding to reduce inconsistency w.r.t.various structure patterns (e.g., aligned BIO pattern).In this paper, we took a step further by pairing label sequences to form a joint SRL task, allowing an easy interface for injecting decoding constraints.
Broader Impact Recent advantages in large language models (LLM) have shown promising performance in semantic parsing tasks (e.g.Drozdov et al., 2022;Mekala et al., 2022;Yang et al., 2022).A well-established approach is via iterative prompting (e.g., Chain-of-Thought (Wei et al., 2022)) and potentially using in-domain examples for prompt construction.While such work bears many technical differences from this work, there are advantages that can potentially be shared.For instance, our direct use of a knowledge base (SEMLINK in our case) allowed for a guarantee of 0 violations; and LLM-based work is less reliant on training data.Another scenario is when treating semantic structures as explicit intermediate products, such as for language generation.Our joint modeling allows for ě 99% accuracy in converting PropBank arguments to VN arguments.When using such labels for prompted inference, it can make fewer errors.
Conclusions In this work, we presented a model that learns from compatible label sequences for the SRL task.The proposal includes a joint CRF design, extension for learning from partially labeled data, and reasoning and learning with SEMLINK constraints.On the VerbNet and PropBank benchmark based on CoNLL05, our models achieved state-of-the-art performance with especially strong out-of-domain generalization.For the newly proposed task of completing VerbNet arguments given PropBank labels, our models are near perfect, achieving over 99 F1 scores.

Limitations
Towards fully end-to-end parser.Our model architecture is on the track of end-to-end SRL parser, but it still assumes gold predicate positions and predicate attributes are given.A fully end-to-end parser can take sentences in raw text and output disjoint label sequences.While doing so can make computation less efficient (e.g., requiring substantially larger memory for training), it can bring users convenience.
Involving document context.Gung and Palmer (2021) showed that using neighboring sentence prediction with transformer positively impacts parsing F1.In contrast, we assumed sentences in the corpus are independent.
Why does PropBank seem more difficult?We hypothesized the reason to be less granularity in argument labels and more ambiguous label assignments.As mentioned in Sec. 3, prior work benefited from using dedicated label text/definition as an auxiliary input.We only used such features at the predicate level, implicitly trading off potential gains on PB F1 for more efficient computation.Marginal model's capacity at handling constraints.In this paper, we focused on the SEM-LINK constraint for compatible label sequences.There is a broad spectrum of SRL constraints in prior work (Punyakanok et al., 2008), some of them do not easily fit in the marginalization formulation, such as the unique core role constraint.
This also comes with a degraded SEMLINK error rate(3.43Ñ 4.08 on WSJ and 8.71 Ñ 10.48

Table 3 :
Impact of SEMLINK at inference time.Each data point represents the average of 3 random runs.Improvements on VN F1s are both substantial and significant (p-value !0.01).

Table 4 :
VN completion with gold PB labels.Results are averaged over models trained with 3 random seeds.

Table 5 :
On the WSJ test set, both VN and PB F1s are fairly close.The Brown test F1s have a drop, especially on VN, suggesting that constrained learning on the joint data is not needed.Ablation of SEMLINK constraint during training using the Joint CRF.✓ indicates SEMLINK constraint is applied; and ✗ indicates not.