Semantic Role Labeling Meets Definition Modeling: Using Natural Language to Describe Predicate-Argument Structures

,


Introduction
Commonly regarded as one of the key ingredients for Natural Language Understanding (Navigli, 2018), Semantic Role Labeling (Gildea and Jurafsky, 2002, SRL) aims at identifying "Who did What to Whom, Where, When, and How?" within a given sentence (Màrquez et al., 2008).More precisely, for each predicate in the sentence, the task requires: i) selecting its most appropriate sense from a predetermined linguistic inventory; ii) identifying its arguments, i.e., those parts of the sentence that are semantically related to the predicate; and, iii) assigning a semantic role to each predicate-argument pair, as shown in Figure 1.Due to the potential uses of these semantically rich structures, the research community has seen steady progress in the task, and SRL has been shown to be beneficial A: SRL annotations using predicate sense and semantic role labels (top) compared with their natural language definitions (bottom).B: the semantics of sense and role labels is undefined for out-of-inventory predicates (e.g., the inventories used for CoNLL-2009 andCoNLL-2012 do not include an entry for "google"), but we can still use valid natural language definitions.for an increasingly wide range of applications in Natural Language Processing (NLP), such as Question Answering (Shen and Lapata, 2007), Information Extraction (Christensen et al., 2011), Machine Translation (Marcheggiani et al., 2018), and Summarization (Mohamed and Oussalah, 2019), as well as in Computer Vision for Situation Recognition (Yatskar et al., 2016) and Video Understanding (Sadhu et al., 2021), inter alia.
An important yet often overlooked aspect of SRL is that, since its conception, the formulation of the task has generally relied upon predetermined linguistic resources, such as FrameNet (Baker et al., 1998), PropBank (Palmer et al., 2005), Verb-Net (Kipper Schuler, 2005) and, more recently, VerbAtlas (Di Fabio et al., 2019), which provide the labels to be used for tagging predicates and their arguments with senses and semantic roles, respectively.Therefore, to this day, SRL has been framed predominantly as a classification task in which systems assign discrete labels to portions of a sentence (Figure 1A,top).Although recent systems have achieved impressive results on standard benchmarks (Hajič et al., 2009;Pradhan et al., 2012) in English (Shi and Lin, 2019;Marcheggiani and Titov, 2020) as well as in multilingual SRL (He et al., 2019;Conia et al., 2021), we observe and emphasize that relying upon discrete labels raises the following critical questions: • The assumption that both predicate senses and semantic roles can be unequivocally categorized into distinct classes has long been -and still is -at the center of numerous discussions because the boundaries between meanings are not always clear-cut (Tuggy, 1993;Hanks, 2000); unsurprisingly, disambiguation approaches that are not tied to specific inventories have been gaining momentum (Bevilacqua et al., 2020;Barba et al., 2021a,b).
• FrameNet, PropBank, and VerbNet are heterogeneous, non-overlapping resources that have led, consequently, to specialized techniques that are more effective on PropBank's rather than FrameNet's labels, or vice versa.
• Relying on any predetermined inventory hinders the ability to generalize to out-ofinventory instances.For example, some rare senses or neologisms may not be covered by the inventory of choice, which, therefore, does not define either their possible senses, or their corresponding semantic roles (Figure 1B,top). 1urthermore, recent progress in NLP at large has primarily pursued state-of-the-art results without giving much importance as to why a system may have a predilection for one particular option over the alternatives, thus making it difficult for a human to interpret their output.And SRL is no exception to this.In this paper, instead, we put forward a generalized formulation of Definition Modeling -the task of defining the meaning of a word or multiword expression in context -to reframe SRL as the task of describing sentence-level semantic relations between a predicate and its arguments using natural language definitions only.More specifically, our contributions can be summarized as follows: 1. We move away from discrete labels and introduce a novel formulation of SRL that reframes the problem as the task of using natural language to describe predicate-argument structures (Figure 1A, bottom).
2. We propose DSRL (Descriptive Semantic Role Labeling), a simple yet effective conditional generation model to produce such natural language descriptions, dropping discrete labels while also demonstrating how to use these descriptions to retrieve standard SRL labels and achieve competitive or even stateof-the-art results on gold benchmarks.
3. In contrast to previous work, our approach provides an interpretable output in natural language, can seamlessly produce descriptions according to different linguistic theories and annotation formalisms, and naturally admits descriptions for out-of-inventory instances (Figure 1B, bottom).
4. We provide an in-depth analysis of the strengths and pitfalls of our approach, showing where there is still room for improvement.
We hope that our semantically-driven descriptions in natural language, free of resource-specific labels that require expert knowledge of SRL, will not only enable easier integration of sentence-level semantics into downstream applications but also provide valuable insights to NLP researchers.

Related Work
Linguistic resources for SRL.As mentioned above, SRL is generally associated with a linguistic theory and a corresponding linguistic resource, which defines an inventory of predicate senses and semantic roles2 (Baker et al., 1998;Palmer et al., 2005;Kipper Schuler, 2005).These inventories are a rich and diverse source of expert-curated knowledge; however, aligning sense and semantic role labels across such resources using manual or automatic techniques (Giuglea and Moschitti, 2006;Palmer, 2009;Lopez de Lacalle et al., 2014;Stowe et al., 2021;Conia et al., 2021) is far from trivial due to their heterogeneous nature, variable degree of coverage, and different granularity.Perhaps it is this complexity that has led researchers towards the development of approaches that are effective mainly in just one of the task "styles", usually PropBank-style SRL (Marcheggiani et al., 2017;Cai et al., 2018;Strubell et al., 2018;Shi and Lin, 2019;Blloshmi et al., 2021;Conia and Navigli, 2022, inter alia) or FrameNet-style SRL (Swayamdipta et al., 2017;Peng et al., 2018;Lin et al., 2021;Pancholy et al., 2021, inter alia).To sidestep this situation, recent studies have analyzed the feasibility of moving away from rigorous linguistic resources and have looked into capturing predicate-argument relations as question-answer pairs, with promising results in the production of questions through slot-filling templates and generative models (He et al., 2015;FitzGerald et al., 2018;Pyatkin et al., 2021).In this paper, instead, we reframe SRL as a generalization of Definition Modeling and directly generate human-readable descriptions of the semantic relations between a predicate and its arguments, replacing discrete labels with natural language definitions to overcome the heterogeneities of linguistic inventories.
Recent approaches in SRL.Independently of the linguistic inventory of choice, given the complexity of the task, early work often employed separate systems for each step of the SRL pipeline (Roth and Lapata, 2016;Marcheggiani et al., 2017).However, in recent years, researchers have successfully managed to develop end-to-end approaches (Cai et al., 2018;He et al., 2018), especially due to the increasing expressiveness of recent neural architectures.Since then, the attention of the community has mainly focused on when syntactic features are useful (Strubell et al., 2018) or can be dispensed with (Conia and Navigli, 2020).Further to this, several studies have also investigated the effectiveness of their proposed approaches on different annotation formalisms, namely, dependency-and after, for simplicity, we follow PropBank and call them senses and semantic roles, respectively, independently of the resource.
span-based SRL (Li et al., 2019;Marcheggiani and Titov, 2020).Most recently, sequence-to-sequence models have found renewed traction by learning to directly generate predicate-argument structures as linearized sequences (Blloshmi et al., 2021;Paolini et al., 2021).Although the focus of our approach is to generate natural language descriptions, we stress that it can be flexibly employed to perform SRL in its traditional formulation, jointly tackling predicate sense disambiguation, argument identification and labeling in a syntax-agnostic fashion for both span-and dependency-based formalisms, the key difference being that our method also produces human-readable and, therefore, interpretable descriptions of the semantics of a sentence.
Definition Modeling.The task of Definition Modeling was originally concerned with producing a natural language definition for a given word and its corresponding embedding (Noraset et al., 2017).The formulation of the task was later generalized to take polysemy into account, as the same word may convey different meanings depending on the context it appears in.Although introduced a few years ago now, Definition Modeling has attracted significant interest (Ni and Wang, 2017;Ishiwatari et al., 2019) and has found success in semantic tasks (Huang et al., 2019;Bevilacqua et al., 2020) such as Word Sense Disambiguation (Bevilacqua et al., 2021, WSD) and Word-in-Context (Pilehvar and Camacho-Collados, 2019, WiC).Motivated by the success of Definition Modeling, we propose a novel generalization of its formulation, in which the objective is to use natural language not only to define a target word in context but also to describe its semantically-relevant sentential constituents.
3 Describing Predicate-Argument Structures using Natural Language In this Section, we introduce our novel reformulation of the SRL task (Section 3.1), describe DSRL, a simple yet effective autoregressive approach for it (Section 3.2), and show how to use DSRL to perform standard SRL (Section 3.3).

Task Formulation
Taking inspiration from Definition Modeling, we propose addressing predicate sense disambiguation, argument identification, and argument classification in an end-to-end fashion as the task of describing the argument structure of a predicate p in a sentence s by generating a natural language description t p that defines not only p but also the semantic relations that connect p to its arguments a 1 , a 2 , . . ., a |A| , where A is the set of arguments of p.For example, if we consider the predicate p = "gave" in the sentence s = "Mary gave the book to John", then a valid natural language description of p and its argument structure could be represented as t p = "give: transfer.
[Mary]{giver} gave [the book]{thing given} [to John]{entity given to}".Indeed, such a sequence contains i) the predicate definition for predicate sense disambiguation, ii) all the arguments of p in s within square brackets for argument identification, along with iii) a definition of the semantic role of each argument within curly brackets.

Description Generation
To tackle our SRL formulation, we introduce a simple end-to-end autoregressive approach that, given an input sentence s and a predicate p in s, generates the natural language description t p of its argument structure.In particular, we devise a sequence-tosequence model whose input sequence s p is defined as follows: where w i is the i-th word in the original sentence s, while <p> and </p> are two special markers that indicate the beginning and the end, respectively, of the predicate p, with k > 1 if p is a multiword expression.Correspondingly, we instruct the model to generate a semantically-augmented sentence t p in which: i) the sense definition of p is prepended to the original sentence, ii) the arguments of p are enclosed within square brackets, and, iii) each argument is followed by its semantic role definition within curly brackets.More formally: where p i is the i-th word of the predicate p, d p i is the i-th word of the definition of p, w a j i is the i-th word for the j-th argument of p, and d a j i is the i-th word of the definition of the semantic role for the j-th argument of p, while k ′ , m j and m ′ j are the length of the definition of p, the length of the argument a j , and the length of the definition of the semantic role for a j , respectively.With this encoding, we then train our sequence-to-sequence model to learn the factorized probability p(t p | s p ) defined as follows: by minimizing the cross-entropy loss with respect to the generated natural language description.

From SRL to Natural Language and Back
Given a dataset annotated with predicate sense and role labels from an inventory that defines such labels in natural language, we note that it is always possible to convert such a dataset to our formulation.3Moreover, although the main objective of our approach is to generate an output sequence that describes sentence-level semantics, in several scenarios, it is still useful to work with discrete labels for predicate senses and semantic roles, e.g., to assess the quality of the generated structures on gold benchmarks with their standard metrics.We stress that our formulation generalizes standard SRL; casting the descriptions generated by our model to standard SRL labels is only possible if the label inventory of choice defines a suitable sense for the target predicate, which is not the case in Figure 1B (top) as the verb "to google" is not covered by PropBank.If the predicate is covered by the inventory, we can easily select the sense or the role label ȳ whose natural language description d ȳ is most similar to the definition d • generated for the predicate p or for one of its arguments a j .We select ȳ as follows: where σ(•) is a similarity function (e.g., cosine similarity), f (•) provides a vector representation of a definition, Y is the set of labels, and d y is the definition of y as provided by the inventory of choice.We note that, for simplicity, we do not apply any post-processing to enforce the validity of the generated output, leaving more complex strategies (e.g., constrained decoding) as future work.
4 Experiments and Results

Data
We train and evaluate DSRL on three widely adopted benchmarks for English SRL, namely: i) CoNLL-2009 (Hajič et al., 2009) for dependencybased PropBank-style SRL, ii) CoNLL-2012(Pradhan et al., 2012) for span-based PropBank-style SRL, and iii) FrameNet 1.7 (Baker et al., 1998) for span-based FrameNet-style SRL.While CoNLL-2009 is a collection of finance-related news from the Wall Street Journal, CoNLL-2012 is a more heterogeneous corpus comprising news, conversations, and magazine articles.FrameNet 1.7, instead, provides a relatively small dataset of annotated documents; following the literature (Swayamdipta et al., 2017;Peng et al., 2018), we include in the training set "exemplar" sentences extracted from partially annotated usage examples from the lexicon itself.We provide a broader look at the characteristics of each dataset in Appendix B and further details about semantic role definitions in Appendix D.

Implementation Details
We implement DSRL using Sunglasses.ai'sClassy. 4As our underlying sequence-to-sequence model, we use BART-large (Lewis et al., 2020), a Transformer-based neural network (400M parameters) pretrained with denoising objectives on massive amounts of unlabeled text. 5We do not modify its architecture except for the embedding layer, where we add the special tokens used to indicate predicates and their arguments,6 as described in Section 3.2.We train our model using RAdam (Liu et al., 2019) as the optimizer for a maximum of 500 000 steps with a batch size of 2048 tokens and a standard learning rate of 10 −5 .
We measure the F1 score on the validation set at the end of each training epoch, adopting an early stopping strategy to interrupt the training process if the F1 score does not improve for 10 consecutive epochs.We do not modify any of the hyperparameters of BART compared to its pretraining phase, and, more generally, we do not run any hyperparameter search due to the cost of fine-tuning the language model.The training process is carried out on a single GPU (a GeForce RTX 3090) and requires about 10 hours for FrameNet, 15 for CoNLL-2009 and 20 for CoNLL-2012.We recall that, in order to evaluate our system with standard scoring scripts,7 we have to cast our descriptions to the discrete labels of the target inventory (see Section 3.3).For this step, we compute the cosine similarity between the representation of a generated description and those of the possible senses or roles, using the sentence-level embeddings of SimCSE (Gao et al., 2021).8

Comparison Systems
We compare our results with the current state of the art in PropBank-style and FrameNet-style SRL.Following standard practice in PropBank-based SRL, we report the results achieved by our system using gold pre-identified (but not disambiguated) predicates, i.e., the position of a predicate (but not its sense label) is given as input to the system.
PropBank-style SRL.We consider Li et al. (2019), who first quantified the benefits of contextualized word representations in both dependencyand span-based PropBank-style SRL, later surpassed by Shi and Lin (2019), who used BERT instead of ELMo, and Conia and Navigli (2020), who designed and took advantage of complex languageagnostic components.We also take into account some studies for PropBank-style SRL that found success by leveraging syntactic features such as He et al. (2019), who devised a strategy to cleverly prune a sentence based on its syntactic dependency tree, and Marcheggiani and Titov (2020), who exploited graph convolutional networks to encode syntactic relations.Most recently, Blloshmi et al. (2021) proposed a simple and general approach to tackle SRL as a sequence-to-sequence task, in which, however, a system is still required to generate a linearized sequence of discrete labels.
FrameNet-style SRL.Although the research community has generally focused on PropBankstyle SRL, especially due to the widespread adoption of PropBank in several CoNLL tasks (Carreras and Màrquez, 2005;Surdeanu et al., 2008;Hajič et al., 2009;Pradhan et al., 2012) and in other resources such as Abstract Meaning Representation (Banarescu et al., 2013, AMR), FrameNet-style SRL has also been at the center of notable studies such as Swayamdipta et al. (2017), who investigated the effect of joint learning of syntactic and semantic features, and Peng et al. (2018), who instead showed the advantages of learning from disjoint data sources.Finally, we also consider recent work by Pancholy et al. (2021), who developed a data augmentation strategy using frame relations, and the above-mentioned Marcheggiani and Titov (2020), who introduced a graph-based neural architecture to tackle FrameNet-style SRL.

Main Results
Here, we first evaluate the robustness of DSRL in achieving strong or even state-of-the-art results on standard benchmarks, and then its flexibility in performing dependency-and span-based, PropBankand FrameNet-style SRL.Remarkably, our model achieves even better results when jointly trained on dissimilar annotation formalisms and linguistic resources, despite their heterogeneous characteristics.
PropBank-style SRL.We first discuss the results obtained by DSRL on the gold standard benchmarks provided as part of the CoNLL-2009 and CoNLL-2012 Shared Tasks, annotated with Prop-Bank sense and role labels.As can be seen in Table 1, we observe strong results in dependencybased SRL, reaching an F1 score of 92.5% in the English test set of CoNLL-2009.Therefore, despite having to cast our natural language descriptions to discrete labels, our approach performs in the same ballpark as the most recent state-of-the-art systems proposed by Conia and Navigli (2020) and Blloshmi et al. (2021); the fact that our approach is able to slightly outperform the latter (+0.1% in F1 score) is particularly meaningful, as they adopt the same pretrained language model (BART-large).We can observe the same behavior in span-based SRL, where our model -without any task-specific modifications -marginally surpasses (+0.1% in F1 score) that of Blloshmi et al. (2021) on the English test set of CoNLL-2012, as shown in Table 2. Thus, the key observation here is that a natural language output does not necessarily hurt performance.
FrameNet-style SRL.As shown in Appendix E, PropBank definitions for predicate senses and semantic roles are quite short, and therefore one may wonder whether our task reformulation is feasible in practice when using longer definitions from richer sources, such as FrameNet, in which the label definitions are up to three times longer.From our experiments, this is, indeed, the case: our 79.9 79.9 79.9 approach achieves state-of-the-art results in fullstructure extraction (Baker et al., 2007) on the test set of FrameNet 1.7, obtaining 79.3 in F1 score (Table 3).We note that the results are not directly comparable with previous work, as DSRL employs a language model (BART) that is different from that of other approaches, e.g., Marcheggiani and Titov (2020) used RoBERTa.However, the results achieved by DSRL still indicate the performance that a generative approach can obtain in framesemantic parsing (Das et al., 2014), which might be considered more complex than PropBank-based SRL.Indeed, predicates in FrameNet usually have a higher degree of polysemy, and the semantic roles are sparser, e.g., there are more than 2000 differ- 5 Quantitative Analysis

Rare and Unseen Senses
The probability with which a word assumes one of its possible senses follows Zipf's distribution (Kilgarriff, 2004), and thus it is very skewed towards the most frequent senses.Here, we analyze the bias that our system shows in predicting the most frequent predicate senses on the following partitions of the CoNLL-2009 and CoNLL-2012 test sets: i) MFS, all the instances containing predicates that are annotated with their most frequent sense; ii) LFS, all the instances containing predicates that are not annotated with their most frequent sense; iii) UNSEEN, all the instances containing predicates that are annotated with a sense that is not present in the training set.
As we can see from Table 5, the performance of our system on predicate sense disambiguation is strong in the MFS partition -more than 98.5% in both CoNLL-2009 and CoNLL-2012 -since the vast majority of predicates are annotated with their most frequent sense.This bias justifies the difference in F1 score between the MFS and LFS partitions, i.e., -11.9% and -9.3% on CoNLL-2009 and CoNLL-2012, respectively.As far as the UN-SEEN partition is concerned, on the other hand, we observe that our approach seems to be capable of generating and retrieving senses that it has never seen at training time with a relatively low decrease in performance (-6.6% and -13.9% compared to the results on the LFS partition).Interestingly, the results on argument labeling are comparable between MFS and LFS predicates.However, there is still large room for improvement in the argument labeling of UNSEEN predicates, whose argument structure represents a more challenging zero-shot setting.

Data efficiency
Considering the large expense entailed in manually annotating text with sense and role labels, we deem it indispensable to also evaluate the flexibility of a system in terms of its scalability on fewer training instances.Therefore, we analyze the results of our model by gradually reducing the training set to 75%, 50%, 25%, and 10% of its original size, and compare this learning curve with that of GSRL (Blloshmi et al., 2021).Notwithstanding the significant differences between the two approaches, both show similar learning curves on CoNLL-2009 and CoNLL-2012 (Figure 2), confirming that manu-  ally annotating more sentences eventually ceases to provide large improvements: in fact, the enormous effort of doubling the training instances of CoNLL-2012 by annotating other 100,000 predicates (from 50% to 100% of its original size) results in less than a 1.0% gain in F1 score.Interestingly, our system shows higher data efficiency in the lowest data regime, especially for span-based SRL with a 2.6% gain in F1 score over GSRL when they are both trained on 10% of the original dataset.We argue that our novel formulation better leverages the pretraining of the underlying language model in lower-data scenarios.However, when more training data is available, task-specific approaches are eventually able to close the gap.Finally, we investigate whether our approach is still capable of handling multiple inventories at the same time in low-data regimes.To this end, we trained the model with several combinations of inventories on 10% of their training data.As we can see from 6 Qualitative Analysis

Generation Examples
In Table 7, we provide some examples of the descriptions generated by our system.Given an input sentence, we compare its gold standard sequence (ĝ) with the one generated automatically (g).We find that, in some cases, the automatic descriptions are more contextual than the gold ones, occasionally overcoming the limitations of the linguistic inventories.In Example 1, for instance, the gold definition of the predicate brandish.01 is only applicable to weapons; instead, the model-generated sequence is preferable as the entity brandished is a flag.In other cases, such as in Example 2, our approach generates more descriptive definitions, e.g., depictor instead of agent, and thing described rather than theme.Furthermore, we show some ex-  amples in which the model generates semanticallyappropriate natural language descriptions for outof-inventory, and thus unseen, predicates.Even in this setting, the model often generates semanticallyappropriate natural language descriptions.This is the case with Example 3, in which the model describes the semantics of nibble.01(unseen at training time) by taking advantage of a similar predicate, namely, peck.01 (seen at training time).This is also true for noun predicates, as shown in Example 4.

Classes of Error
We identify three main classes of error: the first is directly connected to our system (Disambiguation Errors) and the other two (Out-of-Inventory Descriptions and Retrieval Errors) concern the noisy process we use to cast natural language descriptions to discrete class labels.
Disambiguation errors occur when the model generates a definition that does not describe the correct sense of a predicate in a given context.For example, the system provides the wrong definition for the predicate "bumble" in the following sentence s, misclassifying it as "speak quietly": s: Shane survived the week only to have an executive bumbling his way into a criminal investigation.• Gold: speak or move in a confused way • Pred: speak quietly We note that, given the autoregressive nature of the model, producing a wrong sense definition often compromises the entire argument structure.
Ouf-of-inventory descriptions may be produced by our approach since it is not strictly tied to the vocabulary of a predefined linguistic resource.While our model can generate predicate-argument structures not present in the inventory, they can still provide correct semantic explanations.For instance, in the following sentence, the reference and the generated definitions convey the same semantics: • Gold: dupe: trick.He meets [a French girl]{tricker} who dupes [him]{tricked} [into providing a home for her pet and then steals his car]{induced action}.
• Pred: dupe: deceive.He meets [a French girl]{deceiver} who dupes [him]{victim} [into providing a home for her pet and then steals his car]{tricked into}.
Associating "victim" to "tricked" is far from trivial, and such cases often result in retrieval errors, i.e., errors that are caused by the inability of the sentence embedding model -SimCSE in our case -to correctly capture the semantic similarity between the gold and generated definitions.

Conclusion
Recent progress in SRL has mainly revolved around the development of state-of-the-art systems which, however, are bound to specific predicateargument inventories.In this paper, instead, we proposed a novel task formulation that takes a step towards putting interpretability and flexibility in the foreground: we reframed SRL as the task of describing the predicate-argument structure of a sentence using natural language only, which is human-interpretable by definition.Our experiments, supported by in-depth analyses, demonstrated that prioritizing interpretability does not come at the expense of performance.Furthermore, our approach is flexible enough to achieve competitive or even state-of-the-art results on popular gold standard benchmarks for SRL, showing that natural language can act as a bridge between heterogeneous linguistic resources, e.g., PropBank and FrameNet, and also annotation formalisms, e.g., dependency-or span-based SRL.We hope that our model will foster research in high-performance yet interpretable systems in NLP, and provide a means towards achieving easier integration of sentencelevel semantics into downstream applications.

Limitations
Generation.Although our model achieves results on gold standard benchmarks that are on par or even better than the current state of the art, its generative nature certainly makes it slower than previous work based on discriminative approaches (He et al., 2019;Shi and Lin, 2019;Conia et al., 2021).Indeed, our model generates the entire semantically-augmented sentence, i.e., the input sentence with its predicate-argument structures in natural language, autoregressively.While this issue also affects our most direct competitor (Blloshmi et al., 2021), which generates discrete labels, this is still a limitation -or, more precisely, a weakness -we would like to remark.Indeed, before deploying our system in production environments, one should carefully weigh the advantages of our method against its slower inference times.The degree of slowdown will inevitably depend on the hardware, but we estimate that a generative approach could be several times slower than a discriminative one.However, this could also be a matter for further research on the topic; for example, non-autoregressive generative models are steadily narrowing the performance gap (Gu and Tan, 2022) while mitigating the weaknesses of current autoregressive approaches.
Evaluation.Section 6 and Table 7 provide a qualitative analysis of the behavior of our proposed approach on out-of-inventory instances, which may also include rare predicates or neologisms.We acknowledge that a quantitative analysis of how our model really performs on out-of-inventory instances would provide sounder evidence of the benefits of our approach.However, we do not possess the economic and human resources required to create a benchmark large enough for this purpose.We believe that such a benchmark could be a great contribution to the area of SRL, but the endeavor of annotating a significant number of out-of-inventory instances will require further study.
Multilinguality.Extending our work to multiple languages is still a challenge and may require more effort than current approaches, such as that proposed by Conia et al. (2021) which uses languagespecific decoders on top of a shared cross-lingual encoder.One could consider pursuing a similar strategy, i.e., using a shared cross-lingual encoder and multiple language-specific autoregressive decoders.However, the main limitation here is the availability and the structure of current linguistic inventories in other languages and, therefore, definitions in languages other than English.For instance, the Chinese PropBank inventory provided as part of the CoNLL-2009 Shared Task lacks definitions for the majority of the predicate senses, whereas the latest version is not freely distributed.Fortunately, the attention to multilingual SRL is increasing; for example, it would certainly be interesting to analyze the feasibility of our approach to the recently released global FrameNet project.

Ethics Statement
Pretrained language models have been shown to manifest undesirable biases, inherited from the corpora on which they have been trained using selfsupervision strategies.We train our model starting from the weights of BART (Lewis et al., 2020) and, therefore, there is a high probability that these biases are also inherited, or even exaggerated, by our final models.However, we did not investigate such biases in this work; hence, we advise against using our model in a production environment without a careful analysis beforehand.Finally, we remark that the test sets of CoNLL-2009, CoNLL-2012, and FrameNet 1.7 also contain relatively old documents about economics, politics, and past events that do not reflect the current situation.Therefore, the results of such benchmarks are intended only as a basis for comparison with previous approaches and not as a measure of the performance of our model in real-world applications.

A Data License
Both the CoNLL-2009 and CoNLL-2012 datasets are distributed by the Linguistic Data Consortium (LDC) and can be used under the LDC license.9FrameNet 1.7 -the linguistic resource and its annotated dataset -is freely available upon request. 10e note that the original Shared Task of CoNLL-2012 was concerned with the task of Coreference Resolution; however, given its SRL annotations, it soon also became a popular benchmark for spanbased SRL.

B Data Statistics
In Tables 8, 9, and 10, we provide an overview of the statistics of the train, validation and test sets, respectively, for the datasets we use in our experiments, namely, the English splits of CoNLL-2009, CoNLL-2012, andFrameNet 1.7.In particular, for each dataset, we report the number of sentences and their average length in tokens, with FrameNet having the longest sentences on average (+20% over CoNLL-2009 and+40% over CoNLL-2012).We also report the number of annotated predicates for each dataset; interestingly, each predicate in FrameNet features around 6 arguments per predicate, a value that is much larger than those of CoNLL-2009 andCoNLL-2012, which feature around 2.5 arguments per predicate.These are probably the reasons why the FrameNet dataset is particularly challenging, even for modern neuralbased models.
Finally, we can also appreciate the heterogeneity between the characteristics of PropBank-style and FrameNet-style SRL.Indeed, FrameNet clusters predicate senses into frames, resulting in a smaller number of predicate classes (around 1,000) compared to PropBank (5,000 to 8,000).At the same time, the frame-specific semantic roles of FrameNet result in a much larger number of role classes compared to the coarse-grained semantic roles of PropBank.

C Training Sequence Statistics
In Table 11, we report the average length in characters of the sequences used to train our model.As we can see, FrameNet 1.7 features the longest sequences among the three datasets we take into account, in line with what we report in Appendix B.

D Argument Modifiers Definitions
The English PropBank features two categories of semantic roles: core and adjunct.If we define a semantic role as the relationship between an action or event (predicate) and one of the participants (argument), then the former category includes all those semantic roles that mark an important participant in the event, one that is expected to take part in it.In PropBank, these core roles are identified using the labels ARG0, ARG1, . . ., ARG5, and their definitions change from predicate sense to predicate sense.Instead, the second category, namely the adjunct roles or argument modifiers, are general roles whose semantics is not specific to a particular predicate and, therefore, can be used to tag general arguments, e.g., the time of the action (ARGM-TMP) or the place of the event (ARGM-LOC).We use the PropBank guidelines to translate such labels into natural language.In Tables 12 and  13, we list the argument modifiers definitions that we use to train our model on CoNLL-2009 andCoNLL-2012, respectively.While we aimed at creating argument modifier definitions that are homogeneous with the core role definitions, we remark that we did not perform a search for better definitions.As one can see, some of the definitions reported in Tables 12 and 13 are the natural language equivalent of the labels (e.g, ARGM-ADV and its definition "adverbial modifier", ARGM-LVB and its definition "light verb", or ARGM-PRD and its definition "secondary predication", among others).We believe that a possible venue for future research is looking into how we can create better definitions for such semantic roles.

E Definitions Statistics
The length of the sequence that our model generates in output is certainly dependent on the length of the definitions we use to describe the sense of a predicate and its arguments.In this Appendix, we provide a broad look at the number of unique sense and role definitions that appear in the train, validation, and test sets of CoNLL-2009, CoNLL-2012 andFrameNet 1.7.
As we can see in    ingly, the difference between CoNLL-2009 and CoNLL-2012 in the average length of the semantic role definitions is even narrower, whereas the difference in length between PropBank-style and FrameNet-style role definitions widens even further, with FrameNet using role definitions that are almost four times longer than PropBank's.The difference in length between the predicate sense and semantic role definitions between FrameNet and PropBank can be explained by the fact that, in the former resource, the definitions are richer and more detailed.For example, the agent of the predicate provide is defined just as "giver" in Prop-Bank, whereas in FrameNet is defined as "person that begins in possession of the theme and causes it to be in the possession of the recipient".

F Special Tokens
As mentioned in Section 3.2, we use some special tokens to instruct the model on some task-specific functions.For example, we pre-identify a predicate in an input sentence by surrounding its tokens with the special tokens <p> and </p>, indicating the start and the end of a predicate, respectively.Table 16 lists all the special tokens we use in our model in addition to the standard ones (e.g., <s> and </s> to indicate the start and end of the generated sequence).We note that some of these special tokens can be used in combination.For example, combining <propbank<span-srl> informs the model that we want it to generate a sentence anno-  tated with PropBank-style definitions according to the span-based formalism; instead, combining <framenet<span-srl> will result in a sentence annotated with FrameNet-style definitions using a span-based formalism.
For reference, we also provide a few examples of how these special tokens are inserted in an input or output sequence in Table 17, using sentences from the training set of CoNLL-2012.
For the implementation, we simply add these special tokens to the input and output vocabulary of the underlying language model (i.e., BART).The embeddings corresponding to the special tokens are randomly initialized and updated during training.
Figure1: A: SRL annotations using predicate sense and semantic role labels (top) compared with their natural language definitions (bottom).B: the semantics of sense and role labels is undefined for out-of-inventory predicates (e.g., the inventories used for CoNLL-2009 and CoNLL-2012 do not include an entry for "google"), but we can still use valid natural language definitions.

Table 3 :
Results (%) on precision (P), recall (R) and F1 score on the English test set of FrameNet.

Table 5 :
Predicate and argument labeling scores on the test sets ofCoNLL-2009 and CoNLL-2012.We report the performance (F1) on the most frequent senses (MFS), least frequent senses (LFS) and unseen senses (UNSEEN).Support indicates the number of instances (percentage) of the corresponding class.

Table 7 :
Generation examples.Given an input sentence, we compare the gold and the system-generated sequence.Predicates are underlined.

Table 8 :
Overview of theCoNLL-2009, CoNLL-2012, andFrameNet training datasets.For each dataset we report the number of sentences (Total s ), the number of sentences with at least an annotated predicate (Annotated), the average number of tokens per sentence (Avg.Len.), the number of predicates (Total p ) and predicate senses (Senses), and also the number of arguments (Total a ) and argument roles (Roles).

Table 9 :
Overview of theCoNLL-2009, CoNLL-2012, and FrameNet validation datasets.For each dataset we report the number of sentences (Total s ), the number of sentences with at least an annotated predicate (Annotated), the average number of tokens per sentence (Avg.Len.), the number of predicates (Total p ) and predicate senses (Senses), and also the number of arguments (Total a ) and argument roles (Roles).

Table 10 :
Overview of theCoNLL-2009, CoNLL-2012, andFrameNet test datasets.For each dataset we report the number of sentences (Total s ), the number of sentences with at least an annotated predicate (Annotated), the average number of tokens per sentence (Avg.Len.), the number of predicates (Total p ) and predicate senses (Senses), and also the number of arguments (Total a ) and argument roles (Roles).

Table 11 :
CoNLL-2009, CoNLL-2012, andFrameNet training sequence statistics.For each dataset, we report the average length in characters of the sequence used for training the model.domain of CoNLL-2009, which features a significant portion of sentences about finance from the Wall Street Journal, whereas CoNLL-2012 covers a more varied set of domains.Although the number of unique sense definitions is different, the average length of these definitions between CoNLL-2009 and CoNLL-2012 is close, suggesting homogeneous definitions despite the use of two different versions of the English PropBank.This is not the case when comparing the average length of the PropBank definitions used for CoNLL-2009 and CoNLL-2012 with those of FrameNet.Indeed, predicate sense definitions in FrameNet are two to three times longer on average than PropBank's.However, the experimental results reported in Tables 3 and 6 show that our proposed generative model is still able to produce longer sense definitions.We can observe a similar picture in Table15for the definitions of the semantic roles.Interest-

Table 12 :
CoNLL-2009argument modifiers definitions.We provide descriptions for argument modifiers when they are not specified in the given predicate roleset.

Table 13 :
CoNLL-2012 argument modifiers definitions.We provide descriptions for argument modifiers when they are not specified in the given predicate roleset.