Project-then-Transfer: Effective Two-stage Cross-lingual Transfer for Semantic Dependency Parsing

This paper describes the first report on cross-lingual transfer for semantic dependency parsing. We present the insight that there are twodifferent kinds of cross-linguality, namely sur-face level and mantic level, and try to cap-ture both kinds of cross-linguality by combin-ing annotation projection and model transferof pre-trained language models. Our exper-iments showed that the performance of our graph-based semantic dependency parser almost achieved the approximated upper bound.


Introduction
Cross-lingual dependency parsing attracted much attention for its powerful representational capability in grammatical and semantic lexical relations (Zhang and Barzilay, 2015;Guo et al., 2015;Ammar et al., 2016;Zeman et al., 2017Zeman et al., , 2018de Lhoneux et al., 2018;Schuster et al., 2019). Several remarkable contributions have been made in syntactic dependency parsing, especially on universal dependencies (UD;Nivre et al. 2016). For example, Kondratyuk and Straka (2019) showed that a single multilingually fine-tuned neural model utilizing a pre-trained language model could successfully parse 75 languages in UD with comparable performances to state-of-the-art parsers.
However, cross-lingual semantic dependency parsing (Oepen et al., 2014(Oepen et al., , 2015(Oepen et al., , 2016, which is totally different dependency structure from syntactic dependencies (shown in Figure 1), has not been explored as far as we know. A reason for this is the lack of parallel graphbanks that cover many languages with consistent annotation policies. One exception is Prague Semantic Dependencies (PSD; Mikulová 2009), which is a treebank of bi-lexical semantic graphs and contains over 30,000 pairs of parallel annotated sentences from the Wall Street Journal in English and Czech.  Considering these circumstances, we propose to train semantic dependency parsers by capturing commonalities across languages as a remedy for the absence of massive multilingual graphbanks. Our work draws on the intuition that cross-linguality exists in both superficial level and semantic level. Accordingly, we leverage a two-stage fashion involving treebank-based transfer and model-based transfer.
Treebank-based transfer, often called annotation projection, is a method of projecting source language annotations to a target language by using a mapping function such as word alignment. The annotation projection has been reported as a promising approach under truly low-resource settings for UD parsing (Rosa and Mareček, 2018). However, annotation projection often suffers from noise in word alignment (Damonte and Cohen, 2018). For the model-based transfer, several studies on transferring contextualized word vectors have reported that it improves the parsing performance (Mulcaire et al., 2019;Kondratyuk and Straka, 2019).
Our experiments on PSD graphbank indicate that the optimal performance can be achieved by incorporating the two-stage transfer. Surprisingly, we observed improvement even when the projected treebank was erroneous. Furthermore, the two- stage transfer method achieved almost upper-bound performance, which was approximated by evaluating the cross-linguality of PSD annotation through the projection. We also provide detailed analyses from both perspectives of cross-linguality.

Related Work
Semantic Dependency Parsing: The topic of semantic dependency parsing has spurred enduring interest (Peng et al., 2017(Peng et al., , 2018Dozat and Manning, 2018;Wang et al., 2019;Kurita and Søgaard, 2019). Much of the current interest lies in higher-level interactions between relations. In parallel with our study, Aminian et al. (2020) shows improvement of PSD parsing trained on crosslingually projected graphbank with multitask training of UD parsing as an auxiliary task. Unlike their work, we perform a zero-shot training with UDify pretrained model to validate the hypothesis of the two different cross-linguality.
Utilizing Models across Different Graphbanks: Parsing semantic graphs in different semantic abstraction levels was introduced as CoNLL shared task 2019 (Oepen et al., 2019). Candidate teams tackled this problem with methods such as transition-based parsers (Hershcovich et al., 2018;Bai and Zhao, 2019;Lai et al., 2019) and graphbased parsers (Zhang et al., 2019;Koreeda et al., 2019). Small improvements were reported in both approaches, but improving semantic parsing on different semantic graphs remains a difficult problem.

Transfer Strategies
To enable cross-lingual semantic parsing, we focused on two different types of cross-linguality, namely cross-linguality on surface and crosslinguality in semantics. Cross-linguality on surface lies on our hypothesis of typological correspondences among most of languages. For example, annotation projection, which is a treebank-based transfer, assumes cross-linguality on surface and projects source language annotations to a target language. On the other hand, cross-linguality in semantics is based on the assumption that lexical-, phrase-, or sentence-level meaning correspondences may exist among most of languages. Recently, multilingual BERT (Devlin et al., 2019;Pires et al., 2019) directly captures cross-linguality in semantics beyond lexicons by large-scaled language model training on parallel corpora. Though both annotation projection and multilingual pre-trained model can handle either crosslinguality on surface or in semantics, we argue that they could not utilize both cross-linguality in effective way. Hence, we propose two-stage transfer which incorporates both methods, to capture the two kinds of cross-linguality as possible. We firstly introduce the two transfer methods for applying them to PSD graphbank, and then we explain our two-stage transfer method; Project-then-Transfer.
Annotation Projection: As aforementioned, this is the approach focuses on cross-linguality on surface. We trained word alignment model on PSD graphbank, and then projected all annotations in a monolingual graphbank to the other language.
Zero-shot Model Transfer: In this study, we transferred only pre-trained language models, because we aimed to focus on cross-linguality in semantics more.We trained PSD parsers with multilingual pre-trained model on a monolingual graphbank in PSD, and then apply the monolingually trained parsers to the other language.
Project-then-Transfer: We incorporate both transfer methods by applying them in two-stage fashion as shown in Figure 2. Firstly, we prepared multilingually projected PSD graphbanks. We automatically generated PSD annotations on English sentences in a multilingual parallel corpus by the previously introduced English PSD parser which was created in the zero-shot approach. By utilizing bi-lingual word alignment, we projected PSD annotations on English to other languages. We finally trained Project-then-Transfer models on a concatenated graphbank of both original and projected PSD.

Setup and Implementations
To perform word alignment, we used an IBM2 aligner fast align 1 (Dyer et al., 2013).  Table 1: SDP scores for each model and approach. U and L stand for "unlabeled" and"labeled" respectively. P, R, and F stand for "precision", "recall" and "F1-score" respectively. LF/UF is a proxy metric of label prediction accuracy. Bold values represent the best scores. Li et al. (2019) is the best PSD parser at CoNLL 2019 shared task.
To perform model-based transfer, we used mainly graph-based parsers, but we also used a transition-based parser for a comparison purpose. Our graph-based PSD parser employed UDify architecture 2 (Kondratyuk and Straka, 2019). We replaced activation function of biaffine attention layers in UDify with sigmoid activation (Dozat and Manning, 2018). We trained two variances of graph-based parsers, and a transition-based parser: Graph-BERT: We trained it with mulitilinugal BERT as Kondratyuk and Straka (2019) did.
Graph-UDify: We trained it with UDify's pretrained language model 3 instead of multilingual BERT. Since UDify is pre-trained on many languages in UD, we expect that it capture more crosslinguality on surface than BERT.

Transition-BERT:
We used an architecture introduced by Che et al. (2019) 4 , which was the best transition-based parser in the CoNLL 2019 shared task (Oepen et al., 2019). We trained it from scratch with the same hyperparameters given by the source code.
The pre-trained multilingual BERT 5 was downloaded via the above parser implementations. We used mtool 6 to evaluate SDP scores (Oepen et al., 2014) as metrics for parsing performance. A list of the best hyperparameters is available in Appendix. We added "tag loss w", which is a constant multi-2 https://github.com/Hyperparticle/ UDify (Pre-trained models are also available from the link.) 3 We did not utilize biaffine and MLP layers of UDify. 4 https://github.com/DreamerDeo/ HIT-SCIR-CoNLL2019 5 https://github.com/google-research/ bert/blob/master/multilingual.md 6 https://github.com/cfmrp/mtool plied by the loss of relation label predictions. We divided PSD graphbank into three splits, namely train-set (30,000 pairs), dev-set (2000 pairs), and test-set (3653 pairs). We selected the best models by monitoring the labeled F1-score of SDP on the dev-set of the target language and evaluated the scores on the test-set of the target language. We chose Parallel Universal Dependencies (PUD; Zeman et al. 2017) as additional multilingual parallel corpora for Project-then-Transfer, because they contain 1,000 parallel sentences for 18 languages, with mostly consistent UD annotations. Further details are in Appendix. Table 1 shows the SDP scores for each model in each approach. Firstly, we focus on the crosslinguality of PSD annotations by the annotation projection. Unlabeled scores of projection models were within a range of 0.4 -0.5. Since alignment error rate (AER) of English-Czech reported around 0.25 (Legrand et al., 2016), edge projection accuracy could be estimated as (1 − AER) 2 ≈ 0.56 7 . Annotation agreement rate of relations 8 between the two languages was estimated to fluctuate between 0.7 to 0.9 according to the mitigation efficacy of alignment error. Annotation agreement rate of relation labels was estimated as about 0.75 by comparing unlabeled and labeled scores (LF/UF of fast align model). These rates could be upper bounds of performances.

Results and Discussion
By comparing monolingual training (Li et al., 7 Suppose one edge has two nodes A and B, then edge projection accuracy is estimated as probability that both projected nodes A' and B' are correct. 8 We simply divided the unlabeled scores by the projection accuracy estimated by AER.   2019), unlabeled-F of Project-then-Transfer (UDify) was about 85% of that of monolingual training. This ratio is in the range of estimated annotation agreement rate of relations.
Models and Approaches Comparison: As we can see from Table 1, the graph-based models outperformed the Transition-BERT, especially, graphbased UDify models demonstrated superiority to the other models. Graph-UDify of Projectthen-Transfer approach, which is the best model, achieved an unlabeled F1-score of 82.5, that is close to the upper-bounds estimated above. In additions, there were few differences in LF/UF scores, which are also close to the upper-bound of relation label prediction. Thus, our best model achieved high performance, which is close to theoretical upper bounds. This indicates that bi-lexical relations captured by syntactic dependency are also helpful for parsing semantic dependency, yet there remaining information that were not captured in the UDify.
We claim that the missing information was related to cross-linguality on surface, then we perform a deeper analysis on this in the following paragraph.
What is NOT captured by Pre-trained Models?: Figure 3 shows examples of gold and Graph-UDify (cs2en) outputs. The annotation projection had managed completely project unlabeled relations in the source language, but a swapping had happened between two relations, namely "REG" and "PATarg", which had been caused by alignment errors.
We observed that parsers based on the modelbased transfer often failed to parse relations which contain functional words. This phenomenon can be observed in Figure 3c. Those relations containing functional words tended to be successfully converted by the annotation projection. Hence, we obtained better results with the Project-then-Transfer approach as shown in Figure 3d. This implies that pre-trained models including UDify represent rather semantic bi-lexical relations than grammatical ones.
We performed a further analysis on crosslinguality of UDify model. Figure 4 shows relation accuracy for each of four UPOS, namely noun, verb, num and adp. We calculated conditional "unlabeled" relation accuracy, which measures whether source or target word belongs to the specific UPOS type. By focusing on the accuracy of num (numeric) and adp (adposition), which are considered to be hard-to-contextualize examples, the annotation projection outperformed the zeroshot approach. The Project-then-Transfer approach improved the accuracy for almost all UPOS types including num and adp.
Cross-linguality on Surface: Table 5 shows SDP scores trained on English PSD with projected PUD Czech. The performances were comparable to the zero-shot approach, but less than those of the Project-then-Transfer approach. Hence, multilingually projected treebank is significant to improve the performances. This implies that crosslinguality on surface can be captured by training on multilingually projected treebank.

Conclusion
This paper described transfer methods for crosslingual semantic dependency parsing. We showed that both cross-linguality on surface and in semantics were necessary to improve the performance. Consequently, we achieved almost the upper bound performance approximated by the annotation projection. The results encouraged us to develop crosslingual semantic dependency parser for many languages. We will further conduct explore these models, and evaluations on cross-linguality across languages broadly.

Acknowledgments
Computational resource of AI Bridging Cloud Infrastructure (ABCI) provided by the National Institute of Advanced Industrial Science and Technology (AIST) was used. We would like to thank the anonymous reviewers for their helpful comments. We also thank Dr. Masaaki Shimizu for the convenience of the computational resources.

A Appendices
In this section, we provide details of training setup, analyses on relation accuracy and multi-task learning results.

A.1 Detail of Training Setups
In this sub-section, we provide training setups, which are for the reproducibility criteria. Table 3 shows all detailed settings. We did not performed a severe automatic hyperparameter tuning, but did a manual tuning. Thus our best hyperparameters may be different from the true best hyperparam-  eters. We tuned hyperparamters in zero-shot approach, then we reused the best hyperparameters in Project-then-Transfer approach. We trained all models with NVIDIA V100 on Ubuntu 18.04. Our GPU environment is a mixture of both 32GB and 64GB memories. We obtained PSD treebank from the Linguistic Data Consortium (LDC) 9 . We converted original SDP format data to MRP format before the training. This SDP to MRP graph conversion is a loss-less conversion.
A.2 Performances on Dev-set Figure 5 shows expected validation performances. Table 4 and Table 5 show performances on the dev-set. Most of scores were consistent with the performances on the test-set. Only transition-based model is an exceptional case. Though over-fitting seemed to happen, its performances are still lower than those of graph-based models. Figure 6 shows relation accuracy for all UPOS types. We can see that zero-shot performances of eleven types, namely noun, verb, propn (proper noun), conj (conjunction), pron (pronoun), adv (adverb), punct (punctuation), det (determiner), part (particle), cconj (coordinating conjunction), and x (other), are outperformed those of annotation projection, and all content words are included in this group. Zero-shot performances of the other five types, num, sym (symbol), adp, sconj and intj (in-9 https://catalog.ldc.upenn.edu/ LDC2016T10 terjection), are comparable or underwhelmed to those of annotation projection. Because the words categorized as intj only appeared a few times in this analysis, we could not make a discussion regarding intj.

A.4 Multi-Task Learning
We argue that it is natural to perform multi-task learning (MTL) of UD and PSD dependencies when both annotations are available, since UDify's pre-train model, which is trained on UD annotations, improved the performances. Firstly, we added UD annotation on PSD treebank by existing UD parser UDPipe 10 . Our MTL setting is to share only BERT layers, but higher layers including scalar-mix layers are distinct. We used UDify "as is" for UD prediction. A Loss function to perform MTL is a simple linear combination of that of UDify and our PSD model. We show the MTL results in Table 6. Performances were degraded by comparing to those of non-MTL models. This could be because UD and PSD annotations are contradictive to perform MTL.