On the Benefit of Syntactic Supervision for Cross-lingual Transfer in Semantic Role Labeling

Although recent developments in neural architectures and pre-trained representations have greatly increased state-of-the-art model performance on fully-supervised semantic role labeling (SRL), the task remains challenging for languages where supervised SRL training data are not abundant. Cross-lingual learning can improve performance in this setting by transferring knowledge from high-resource languages to low-resource ones. Moreover, we hypothesize that annotations of syntactic dependencies can be leveraged to further facilitate cross-lingual transfer. In this work, we perform an empirical exploration of the helpfulness of syntactic supervision for crosslingual SRL within a simple multitask learning scheme. With comprehensive evaluations across ten languages (in addition to English) and three SRL benchmark datasets, including both dependency- and span-based SRL, we show the effectiveness of syntactic supervision in low-resource scenarios.


Introduction
The task of semantic role labeling (SRL) annotates predicate-argument structures in text and is thus a desirable output of natural language processing (NLP) pipelines designed to extract information from text (Gildea and Jurafsky, 2002;Palmer et al., 2010). Recent developments in neural architectures (Vaswani et al., 2017) and pre-trained contextualized representations (Devlin et al., 2019;Liu et al., 2019) have greatly improved the performance of SRL systems (Zhou and Xu, 2015;He et al., 2017;Tan et al., 2018;Shi and Lin, 2019). However, most previous work focuses on high-resource English SRL scenarios, and it remains a challenge to extend these approaches, which require plentiful supervised examples, to other languages where training resources may be limited.
A popular approach addressing this challenge is cross-lingual learning: leveraging the shared struc-tures across human languages to transfer knowledge from high-resource languages to low-resource ones. Model transfer, where an SRL model is directly transferred across languages using shared representations (Kozhevnikov andTitov, 2013, 2014;Fei et al., 2020b), is a particularly promising approach thanks to recent developments in multilingual contextualized representations (Lample and Conneau, 2019;Conneau et al., 2020), which have proven effective for cross-lingual transfer (Wu and Dredze, 2019;Pires et al., 2019).
Another common strategy for improving SRL model performance in both high-and low-resource scenarios is incorporating syntactic information. Syntactic analysis was until recently considered a prerequisite for most SRL systems (Gildea and Palmer, 2002;Punyakanok et al., 2008) and has been shown to benefit recent neural models as well (Marcheggiani and Titov, 2017;He et al., 2018;Swayamdipta et al., 2018;Strubell et al., 2018). Despite much work exploring cross-lingual learning and incorporating syntactic information into SRL systems, most such previous work explores these two avenues separately, though there are numerous reasons that carefully incorporating syntax into a cross-lingual system for SRL could provide further benefits: First, whereas SRL annotations are limited to only about a dozen languages, much richer resources are available for syntax, thanks to the development of the Universal Dependencies (UD) framework and accompanying corpora (Nivre et al., 2016b(Nivre et al., , 2020, which defines syntactic annotations that are consistent across languages, with treebanks in over 100 languages to date. Second, UD treebanks in particular have the potential to increase beneficial sharing of information across languages by providing a unified syntactic structure to ground cross-lingual representations. Most previous work utilizing syntax for crosslingual SRL have incorporated syntactic information only as an input to the model, either as sparse features (Kozhevnikov and Titov, 2013;Pražák and Konopík, 2017) or as structures for tree encoders (Fei et al., 2020b). These strategies require syntactic pre-processing by an additional model and can suffer from error propagation. In this work, we explore an alternative approach that has yet to be explored in the cross-lingual setting: adopting syntactic annotations as auxiliary supervision and performing multitask learning (Caruana, 1997) together with SRL (Swayamdipta et al., 2018;Strubell et al., 2018;Cai and Lapata, 2019).
To evaluate the extent to which syntactic supervision can help facilitate cross-lingual transfer in SRL, we perform a comprehensive empirical analysis on three SRL benchmark datasets, covering ten languages (in addition to English). We evaluate our models in both zero-shot and semi-supervised scenarios, and on both dependency-and span-based SRL. Highlights of our findings include: • Training SRL models with syntactic supervision is consistently helpful in low-resource SRL scenarios. ( §3.2, §3.3, §3.4, §3.5) • When lacking direct syntactic annotations for the target language, available treebanks from related languages can be used instead to improve SRL performance ( §3.4) • For span-based SRL, a syntax-aware SRL decoder out-performs BIO-tagging when combined with syntactic training. ( §3.5) Our implementation is available at https:// github.com/zzsfornlp/zmsp/.

Model
We adopt the typical encoder-decoder paradigm for multi-task learning to perform syntactic dependency parsing and SRL together in one model. A shared encoder gives the hidden representations for the input words and each task has its own decoder that takes those shared representations as inputs and predicts task-specific labels. We hypothesize that syntactic training can provide helpful signals for SRL through the shared encoder.

Encoder
We adopt multilingual pre-trained contextualized models as our encoder, following previous work reporting strong performance for SRL (Shi and Lin, 2019;He et al., 2019;Conia and Navigli, 2020) and cross-lingual learning (Wu and Dredze, 2019;Pires et al., 2019). For an input sequence of words w 1 , . . . , w n , the encoder produces their contextualized representations h 1 , . . . , h n . These pre-trained models take sub-word tokens as input, but our SRL and syntactic data have word-level annotations, so we take the first sub-token of a word as its representation. These representations are then provided to task-specific decoders.

Syntax Decoder
For the syntactic (dependency) parsing task, we ignore the single-head constraints in training and view it as a pairwise labeling task into the space of dependency labels R d : denotes the probability that the head w H has a dependency relation r d to the modifier w M (or , which means no syntactic relation). Following Dozat and Manning (2017), we use biaffine modules for the scoring (score r d ), which take the encoder representations and produce relation scores. For training, we use cross-entropy as the objective. Notice that although this type of pairwise formulation is not widely used for syntactic dependencies, it has been shown effective for semantic dependency parsing (Dozat and Manning, 2018). Our main motivation 1 to utilize it here is to make the syntactic task more similar to SRL. In our syntactic parsing evaluation, we find that this method obtains similar results to the head-selection method.

SRL Decoder
We focus on the end-to-end SRL task, which extracts both the predicates and their arguments (i.e. we do not assume gold predicates unless otherwise noted). For argument extraction, we explore two categories of SRL formalism: dependency-based SRL, which only requires labeling the syntactic head word of an argument, and span-based SRL, which requires labeling full argument spans.

Predicate Identification
Predicate identification is cast as a binary classification task. We use a linear scorer over each word's 1 Another potential benefit is that certain parameters of the output layers may be shareable between syntactic and SRL decoders. Though in preliminary experiments we did not find obvious improvements with a simple method of stacking another task-specific classification layer and sharing the middle biaffine layers, this could be an interesting direction to explore with better parameter-sharing schemes.  encoded representations to judge whether it triggers a semantic frame.

Dependency-based SRL
For dependency-based SRL, the problem can be again formalized as a pairwise labeling task, and we treat it in a similar way as in the syntax decoder: r ∈Rs∪{ } e score r (h P ,h A ) Here p(r s |w P , w A ) denotes the probability that a predicate w P takes w A as an argument with the semantic role r s (or , which denotes no semantic relation). Again we use biaffine modules for scoring and cross-entropy as the objective function.

Span-based SRL
Predicting argument spans is usually cast as a sequence labeling problem, with most recent neural SRL models adopting a simple BIO-tagging decoder (Zhou and Xu, 2015;He et al., 2017;Tan et al., 2018;Shi and Lin, 2019). In this work, we further consider a two-step syntax-aware approach (Zhang et al., 2021), where the first step identifies the argument head and a second step decides span boundaries given the head identified in the first step. Here, the first step is exactly the task of dependency-based SRL and we use the same decoder. For the second step, we adopt the span selection method from extractive question answering (Wang and Jiang, 2016;Devlin et al., 2019) and use two classifiers to decide the start and end of the span given the head word.

Training Scheme
To deal with the multi-task and multilingual scenarios, we adopt a simple training scheme. For each training step, we first sample a task (parsing or SRL), and then a language (source or target). Based on these, we sample a batch of instances from the corresponding dataset and train the model on the selected task. In our experiments, we apply fixed sampling rates for the selection of tasks and languages (1:2 for parsing vs. SRL and 1:1 for source vs. target). In preliminary experiments, we also tried varying sampling rates, but did not find obvious improvements. Exploration of more sophisticated training schemes is left to future work.
We take English as the source language 3 and transfer to other target languages. For experiments on UPB and FiPB, we assemble the English SRL dataset with EWT and its SRL annotations from PropBank v3. For CoNLL-2009 and OntoNotes, we utilize the corresponding English sets. For evaluation, we calculate labeled F1 score for arguments. Conventionally, predicate senses are also evaluated for dependency-based SRL. However, cross-lingual transfer of sense disambiguation provides a nontrivial challenge (Akbik et al., 2016a), since it is lexicon-based and language-dependent. Moreover, argument labeling can be more related with dependency syntax, while sense disambiguation is more on the semantic side and semantic-oriented signals (like bilingual dictionaries or parallel corpora) may be more directly effective to enhance cross-lingual transfer. Therefore, in this work we  focus on arguments and do not perform or evaluate sense disambiguation, following the conventions of span-based SRL.
For syntactic resources, we use either UD treebanks or convert constituency trees to dependencies using Stanford CoreNLP (Manning et al., 2014). In most of our settings, we assume access to multilingual syntax annotations for both source and target languages. We regard this as a practical setting since UD treebanks are available for a wide range of languages and syntactic annotations may be easier to obtain than semantic ones.
We adopt pre-trained multilingual language models (multilingual BERT (Devlin et al., 2019) or XLM-R (Conneau et al., 2020)) to initialize our encoders and fine-tune the full models. We use the Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 2e-5. We train the models for 100K steps with a batch size around 1024 tokens for each step. All models are trained and evaluated on one GTX 1080 Ti GPU, and training one model usually takes around half a day.

UPB
UPB annotates 4 target languages with English PropBank frames, which allows us to explore zeroshot experiments without any target SRL training resources. We follow the setting of (Fei et al., 2020a): training the models with English SRL annotations (EWT) and directly applying them to the target languages. In this experiment only we assume predicates are given since UPB is limited to verbal predicates, which leads to discrepancies between source and target predicate annotations. For the syntactic resources, we take the corresponding treebanks (upon which UPB is annotated) from UD v1.4 (Nivre et al., 2016a) and simply include them as additional training data for syntactic supervision.

Comparisons
We first compare several strategies on the usage of syntax, and results on the development set are shown in Table 2. Here we utilize multilingual BERT (mBERT) for the basic encoder. The table is split into three groups: • Syn varies which syntactic resources are used.
The four rows denote no syntax (NoSyn), only source (English; EnSyn), only targets (other six languages; TargetSyn) and full syntactic resources (English plus other six; FullSyn). Here, only adding source syntax is not helpful, but target syntax information is generally beneficial. Furthermore, combining both source and target syntax leads to the best results. • SEQ explores a sequential two-stage fine-tuning scheme (Phang et al., 2018;Wang et al., 2019): first training the model with an auxiliary task (syntax or others) and then with the target task (SRL). Using syntactic parsing as the intermediate task can bring clear improvements, but it is slightly worse than the MTL scheme. Here, we also explore a masked language model (MLM) intermediate objective (Devlin et al., 2019) as a baseline, using the raw texts of the UD treebanks. Though it can slightly improve the results, the gains are much smaller than those due to syntax. • GCN uses syntax as inputs. We stack a graph convolutional network (GCN) (Kipf and Welling, 2017) between the encoder and decoders to encode input dependency trees. Specifically, we adopt the architecture of (Marcheggiani and Titov, 2017). Using gold trees in this setting out-performs the MTL strategy. However, when using predicted syntax, 5 error propagation re-    Table 4: UPB test Arg-F1(%) scores in the English-toothers zero-shot setting (averaged over five runs).
duces the observed benefit. The MTL scheme is an attractive alternative strategy considering its competitive performance and model simplicity.

Main Results
The test results are listed in Table 4. Similar to the trends in the development sets, including syntactic signals brings clear improvements, especially for the more distant Finnish language. Using XLM-R, which is pre-trained on more data than mBERT, is also helpful, 6 upon which syntax can still bring further benefits. We also compare with the results from (Fei et al., 2020a), which translates and projects source SRL instances to target languages for training. The translation-based method performs strongly for German, French and Spanish. Considering that German and French are commonly used languages in machine translation research, availability of high-quality translation systems may be one of the contributing factors. Our syntax-enhanced models are generally competitive for other languages. It would be interesting to further explore the combination of translation and syntax in future work.

Varying Treebank Sizes
We further vary the number of available syntax trees for the auxiliary parsing task, for which (both source and target) and again include them in training. The results indicate that we do not need the full treebanks to obtain good results. Especially with XLM-R, 1K trees from each language can already lead to gains comparable to the 10K case.

FiPB 7
Similar to the experiments on UPB, we take English SRL annotations from EWT as the source. FiPB adopts (almost) the same argument role set 8 as the English ones and we use a shared SRL decoder for both languages. In preliminary experiments, we find that this sharing strategy performs better than using separate, language-specific decoders. For syntax, we again take corresponding English and Finnish treebanks from UD v1.4.

Results
The main results on FiPB are listed in Table 3. In the lowest-resource scenario (0.1K Finnish SRL sentences), both English SRL and syntax are quite helpful, and combining them leads to further improvements. The trend is similar if given 1K target SRL annotations, but the gaps decrease. Finally, when given enough target training instances as in the 10K scenario, the gains due to extra resources (either English SRL or syntax) are negligible. In this case, the model may have already learned most of the patterns from rich target SRL annotations.

Varying Training Sizes
We further vary both syntax and target-SRL training sizes, and the influence on model performance is shown in Figure 2. Here, all the models are trained using all English SRL and varying amounts of Finnish SRL sentences. The numbers in parentheses at y-axis show the F1 scores of baseline models without syntax. As expected, syntax is more helpful when we have less target-SRL and more syntactic resources (towards the right corner in the figure). When we have more target SRL annotations, syntactic resources become less helpful. Nevertheless, in low-resource scenarios, even small quantities of syntactic annotation can bring clear improvements.

Analysis
We further perform analysis on the development results in the 1K case, as shown in Table 5. In the first group of role label breakdowns, adding syntax particularly helps core arguments while adding English SRL helps more on non-core arguments. Finally, combining both leads to the best results overall. In the second group, we break down arguments by their syntactic distance to the predicates.
The results show that syntactic supervision is still beneficial when the predicate and the argument are two edges away (d=2). However, when syntax distance is larger, direct syntactic supervision becomes less helpful.  Table 5: Analysis (F1% breakdown) on the FiPB development set (1K setting). The first block denotes breakdowns on argument roles, the second denotes syntactic distance between predicate and argument words, and the third denotes the syntactic path between them. The numbers in parentheses denote percentages. Bold and underlined numbers indicate the best and second-best results respectively.
In the third group, we look at the labeled syntactic paths between the arguments and the predicates. For example, " nmod ← −− −" denotes that the argument is a syntactic modifier of the predicate and the dependency relation is "nmod", while " acl − →" denotes the argument is the syntactic head of the predicate with the dependency relation of "acl". We show the results on top-ten frequent paths, which cover around 80% of all the arguments. According to the breakdown results, the syntactic supervision helps more on the edges of subject, direct object and some functional relations (like copula), while English SRL is more beneficial on the more semantic links, such as adverbial words and clauses. This agrees with our analysis on the argument roles: the syntax helps more on the core arguments, which are usually directly connected as subjects or objects, while English SRL helps more on "ArgM"s, which tends to be adverbial.

CoNLL-2009
The original SRL annotations of CoNLL-2009 are based on language-specific syntax, causing the argument head words to disagree with UD conventions. We thus follow Pražák and Konopík  are shown in Figure 3. The patterns are consistent among all languages and similar to previous experiments on FiPB: syntax is clearly helpful in low-resource scenarios, but as we have access to more target SRL annotations, the gaps decrease and finally diminish in the high-resource scenarios.

Using Other Treebanks
We further explore the scenarios where we do not directly have syntactic annotations for the target language. Considering that the parsing task can also benefit from cross-lingual transfer, we can utilize treebanks from nearby languages for syntactic supervision. We take Spanish and Catalan (the 0.1K target SRL case) for this analysis and the results are shown in Table 6. We further ex-  Table 7: OntoNotes Arg-F1(%) scores in English-sourced semi-supervised settings (with different numbers of target SRL training sentences). "BIO" indicates using a BIO-based sequence labeling decoder and "TwoStep" denotes the syntactically-aware decoding method which first extracts head words then decides span boundaries.
plore three Romance languages: French, Italian and Portuguese. As expected, directly using targetlanguage syntax obtains the best results. Spanish and Catalan, which are closely related languages, benefit each other the most. Nevertheless, compared with the NoSyntax baseline, syntactic information from all these languages are helpful. This result is of practical interest when transferring to a truly low-resource language where syntactic annotations may also be limited. Finding a related language with rich syntactic resources for auxiliary training signals is a promising way to improve performance.

OntoNotes
Finally, we turn to span-based SRL where the extraction of full argument spans is required. Utilizing OntoNotes annotations, we still take English as the source and Chinese or Arabic as the target. Similar to FiPB, the argument roles are compatible with PropBank-style English roles and we use a shared SRL decoder for both the source and target languages. We adopt data splits from the CoNLL12 shared task (Pradhan et al., 2012). Similar to those of CoNLL-2009, for English and Chinese, we convert constituencies to dependencies with Stanford CoreNLP. For Arabic, we assign dependency trees from Arabic-NYUAD (Taji et al., 2017) treebank of UD v2.7.

Results
In this experiment, we specifically compare two SRL decoders. The first one casts the task as a BIO-based sequence labeling problem. We further add a standard linear-chain conditional random field (CRF) (Lafferty et al., 2001), which we found consistently helpful in preliminary experiments. The other one is the two-step decoder described in §2.3. As shown in Table 7, the trends are similar for both Chinese and Arabic. With regard to auxiliary syntactic supervision, we find similar trends to previous experiments: in low-resource scenarios, syntactic supervision is beneficial for both decoders, but as the availability of target SRL resources increases, the gaps become smaller until diminished. The more interesting comparisons are between the two decoders: when not using syntactic supervision, their performances are comparable; but when trained with auxiliary signals from syntax, the syntax-aware two-step decoder performs better than the BIO tagger, especially in low-resource cases. Please refer to Appendix C.4 for more detailed error analysis.

Syntax with Genre Mismatches
Since English and Chinese OntoNotes also annotate six different genres of text, we further explore scenarios where the syntax and SRL datasets have genre mismatches. We still take all English instances for multilingual training, but split the Chinese corpus according to genres, including broadcast conversation (bc), broadcast news (bn), magazine (mz), newswire (nw), telephone conversation (tc) and web (wb). We focus on the low-resource scenario where 0.1K Chinese SRL sentences on the target genre are available. The development results are shown in Figure 4. When the genre of syntactic supervision matches the target SRL, the improvements are the largest. Nevertheless, even in the case of genre mismatches, syntax can still be beneficial, especially within similar genres. We further find a positive correlation (Pearson corre-  lation is 0.73; Spearman is 0.78) between these improvements and genre similarities calculated by the centroids of mBERT representations (Aharoni and Goldberg, 2020). This may provide a mechanism for selecting the most beneficial syntactically annotated instances.

Cross-lingual SRL
Recently there have been increasing interests in cross-lingual SRL, where SRL annotations from high-resource languages are utilized to help lowresource ones. One straightforward method is data transfer, using either annotation projection (Yarowsky and Ngai, 2001)

Conclusion
In this work, we provide a comprehensive empirical exploration of the helpfulness of syntactic supervision for cross-lingual SRL. With extensive evaluations across a variety of datasets and settings, we show that auxiliary syntactic signals are generally beneficial, especially in low-resource SRL cases. We hope that this work can shed some light on the relations between syntax and SRL in the cross-lingual scenarios.

Appendices A Detailed Experiment Settings
A.1 Datasets Table 8 presents the statistics of the datasets that we utilize. The details of each (group of) dataset are described in the following.
EWT/UPB/FiPB is the group where we assemble the English SRL dataset with English Web Treebank 11 (EWT) and PropBank v3 12 , and utilize it as the source annotations. We take Universal Proposition Banks 13 (UPB v1.0) (Akbik et al., 2015(Akbik et al., , 2016b and Finnish PropBank 14 (FiPB) (Haverinen et al., 2015) for the targets. UPB annotates target langauges with English PropBank frames and role labels. This allows zero-shot cross-lingual learning, which is our main setting for experiments with UPB. In the UPB experiment only we assume predicates are given since there are discrepancies between source and target predicate annotations.
In experiments with FiPB (as well as CoNLL-2009 and OntoNotes), we focus on semi-supervised multilingual scenarios with end-to-end models that perform both predicate identification and argument labeling. FiPB is a collection of semantic frames built on top of the Turku Dependency Treebank (TDT). The frames are Finnish specific, but the role labels are (almost) the same as the PropBank ones (Arg0, Arg1, ..., ArgM-*). FiPB defines only two additional ArgMs: CSQ (consequence) and PRT (phrasal marker). based SRL annotations are based on languagespecific dependencies. To further encourage shared structures, we convert them to the ones based on UD. Details of the conversion are described in the next section.
OntoNotes annotates a large corpus 18 in three languages (English, Chinese and Arabic) with various layers of structural information. We take the SRL annotations from it for our experiments. For English, we utilize the data 19 from (Pradhan et al., 2013), while for Chinese and Arabic, we directly use those provided by CoNLL12 20 . For all the languages, we also follow the data splittings of CoNLL12. Similar to FiPB, the SRL annotations in OntoNotes utilize language-specific frames but compatible argument role sets.

A.2 Hyper-parameters
Without specifications, we use pre-trained multilingual language models (mBERT or XLM-R) to initialize the encoders and fine-tune the full models in our experiments. The parameter numbers of the full models are 185M and 285M, for those with mBERT and XLM-R respectively. For the hyper-parameter settings, we mainly follow common practice, and only slightly tune them in preliminary experiments 21 . We apply dropout rates of 0.1 to the encoder and 0.2 to the decoders. We use Adam as the optimizer with an initial learning rate of 2e-5. The learning rate is linearly decayed towards 2e-6 through the training process. The models are trained for 100K steps, where each step contains a batch of around 1024 tokens. We evaluate the model on the development set every 1K steps and the best model is selected by validation results. For zero-shot experiments, we simply validate with the source development set. For semisupervised experiments, we use the target language, but down-sample the original development set to 10% of the target training size. All the training and evaluations are performed on one GTX 1080 Ti GPU.  day while decoding is fast with several hundreds of sentences processed per second. All the results reported in this work are averaged over three (for most ablation studies) or five (for most test results) runs. The evaluation of arguments follows the standard evaluation script of srl-eval.pl 22 .

B UD-based Conversion for CoNLL-2009
The SRL annotations of argument heads in CoNLL-2009 are based on Language-Specific Dependency (LSD) trees rather than Universal Dependencies (UD). To convert argument heads between different syntactic formalism, we adopt a simple path-based method. Assuming that for a predicate p, it has an argument whose head is a according to the original tree, the conversion aims to find a new head according to the new tree: 1. In the new tree, find the lowest common ancestor c of the predicate p and the original argument head a.  Figure 5: An example for the conversion between language-specific dependencies (LSD) and universal dependencies (UD). For brevity, we only show the important dependency edges. Here, "ran" is the predicate (P) and the argument head is "in" with LSD and "park" with UD. The conversion between these two can be done by comparing the syntactic paths and descendants with the old and new trees.  Table 9: Agreements between Language-Specific Dependencies (LSD) and Universal Dependencies (UD).

Go down from
Here, "UAS" denotes the unlabeled attachment scores when comparing LSD and UD trees, "Arg-Agree" denotes the agreement rates on argument heads between original argument heads and those converted to UD, while "Roundtrip-Agree" denotes the agreement rates with round-trip styled conversions: first converting from LSD to UD and then converting back to LSD. (* Czech is a special case where the original argument heads seem to mostly agree with UD.) agreements between different syntactic formalism. Although on overall syntactic attachments, LSD disagrees much with UD (the highest UAS is 60% for Chinese), the argument head agreement rates are much higher (the lowest argument agreement rate is around 70% for Spanish  (no auxiliary syntactic supervision), "LSD" (original language-specific dependencies) and "UD" (universal dependencies). For SRL, we have the options of "Orig." (original argument heads) and "UD" (argument heads converted according to UD trees). If training with UD-SRL, we adopt a post-processing step and convert the argument heads back to original ones with LSD for fair comparisons. Notice that for Czech, we do not have results for UD-SRL since there are no easy ways to convert arguments back to the original ones (which disagree much with LSD and slightly disagree with UD).
same as our main experiments on CoNLL-2009. Here, we take full English SRL and 1K target SRL sentences. Table 10 lists the target development SRL results. Similar to the results in our main experiments, syntactic supervision is beneficial for all languages, and this holds true for both the original language-specific dependencies and the universal dependencies. Interestingly, using original syntax trees and argument heads performs the best, especially for Spanish and Catalan. Through error analysis, we find that for these two languages, the "LSD+Orig." model is much better than the "UD+UD" model mainly on arguments whose original head word is preposition (5 F1 points better for Catalan and 3 points better for Spanish). The reason might be that prepositional word types appear more frequently than content words like nouns and proper nouns, and might be easier to extract if adopting LSD and using prepositions as argument heads, especially at low-resource scenarios.
Though UD seems slightly less effective than the original LSD in this experiment, we still utilize UD-based ones (both syntax and SRL) in our main experiments, considering the potential to extend to more languages. It would also be interesting to explore the combination of different syntactic formalism, which we leave to future work.  "EnSRL" indicates whether using English SRL, and "Syntax" denotes whether using syntax.
C Extra Results

C.1 Semi-supervised Results on UPB
We also experiment with semi-supervised settings on the UPB datasets. We still take English as the source and randomly sample SRL training instances for target languages and train the models alongside all these source examples. The results are shown in Figure 6, where adding target SRL annotations can bring obvious improvements. Nevertheless, including syntactic supervision is still helpful, particularly in low-resource scenarios.

C.2 No Pre-trained Initialization
In the main experiments, we utilize pre-trained multilingual language models to initialize the encoders.
Here, we explore the case where no such initialization is performed (taking FiPB as an example). All other settings are the same as previous, except that the models are all randomly initialized. The training scheme is slightly modified: we perform learning rate warmup for the first 10K steps and increase the maximum learning rate to 1e-4. The results on the development sets are shown in Table 11. There are no surprises that the scores are much lower than those with pre-trained models. Interestingly, though both English SRL and syntax   can provide improvements in both low-resource and high-resource cases, syntax is much more helpful. A possible reason is that the multilingual pretraining provides shared representations across languages, without which the extra supervision from other languages may be much less effective.

C.3 Other Languages as Source
In our main experiments, we take English as the source language since it is usually the language that has the most abundant resources. Here, we take some other languages as the source and English as the target. Specifically, we use FiPB/EWT and OntoNotes for these experiments, where other settings exactly follow those of the main experiments. The development results are shown in Table  12 and 13. The general trends are very similar to those in the English-as-source experiments, where syntax supervision is generally helpful, especially in low-resource scenarios. There are many other interesting settings that are not covered in this work, such as multi-source transfer and direct transfer among non-English languages. We leave the explorations of these to future work.

C.4 Error Analysis on OntoNotes
We further perform error analysis on the Chinese and Arabic development set in the 1K setting. As shown in Figure 7 and 8, syntactic supervision and the syntax-aware TwoStep decoder make fewer errors related to phrasal attachments, span boundaries and predicate identification. Notice that the first two categories are closely related to syntax, which may explain why syntax-informed models make fewer such errors. In particular, the two-step model trained with syntactic supervision makes the fewest syntax-related errors. Together with its generally better overall F1 scores, these demonstrate the benefits of utilizing syntactic information alongside a suitable syntactically-aware model.