Frustratingly Simple but Surprisingly Strong: Using Language-Independent Features for Zero-shot Cross-lingual Semantic Parsing

The availability of corpora has led to significant advances in training semantic parsers in English. Unfortunately, for languages other than English, annotated data is limited and so is the performance of the developed parsers. Recently, pretrained multilingual models have been proven useful for zero-shot cross-lingual transfer in many NLP tasks. What else does it require to apply a parser trained in English to other languages for zero-shot cross-lingual semantic parsing? Will simple language-independent features help? To this end, we experiment with six Discourse Representation Structure (DRS) semantic parsers in English, and generalize them to Italian, German and Dutch, where there are only a small number of manually annotated parses available. Extensive experiments show that despite its simplicity, adding Universal Dependency (UD) relations and Universal POS tags (UPOS) as model-agnostic features achieves surprisingly strong improvement on all parsers.


Introduction
Semantic parsing is the task of transducing natural language to meaning representations, which in turn can be expressed through many different semantic formalisms including Discourse Representation Theory (DRT) (Kamp and Reyle, 2013), Abstract Meaning Representation (AMR) (Banarescu et al., 2013), and so on. However, manually annotating meaning representations in a new language is a painstaking process which explains why there are only a few datasets available for different formalisms in languages other than English. For instance, in the case of DRT, even in the latest release of the Parallel Meaning Bank (PMB) (v3.0, †Work done while Federico Fancellu was at the University of Edinburgh. Abzianidze et al., 2017), there is no gold training data available for Italian and Dutch and the few annotated sentences are used as development and test data. How can one train a parser when training data is unavailable? This work answers this question by looking at what it is required for zeroshot cross-lingual semantic parsing -learning a semantic parser for English and testing it on other languages.
Prior research on cross-lingual semantic parsing leveraged machine translation techniques to map the semantics from a language to another (Damonte and Cohen, 2018). However, these methods require parallel corpora to extract automatic alignments which are often noisy or not available at all. Other work exploited parameter-shared models, which are based on language-independent representations. In particular, cross-lingual word embeddings and pretrained multilingual models (Hu et al., 2020) have been used in various cross-lingual NLP tasks, including semantic parsing (Oepen et al., 2020;Sherborne et al., 2020). However, pretrained multilingual models are computationally expensive while cross-lingual word embeddings have inferior performance. To this end, we propose to add simple language-independent features in zero-shot crosslingual semantic parsing, which are lightweight extensions to all models instead of designing new architectures with high time and space complexity.
our hypothesis, we focus on the PMB, where sentences in English, German, Italian and Dutch are annotated with their meaning representations. The annotations in the PMB are based on Discourse Representation Theory (DRT, Kamp and Reyle, 2013). Figure 1 shows an annotation example usign DRT for the sentence "Ich bin schläfrig. (I am sleepy.)". A Discourse Representation Structure (DRS) is a nested structure with unary and binary predicates representing semantic roles, alongside logic operators (e.g. ¬) and discourse relations (e.g. CONTINUATION). In this example one can see that although a parser could understand the coarse meaning of the whole sentence without UD, it struggles with understanding some lexical-level meaning. However, it successfully identifies "schläfrig" as the event, based on its UD relation tag "root".
We carry out our experiments on 6 DRS parsers to test different architectures (LSTM vs. Transformers) as well as different decoding strategies (sequential vs. coarse-to-fine). Whereas the original parsers utilize a sequential neural encoder with monolingual representations, we experiment with cross-lingual representations (i.e. cross-lingual word embeddings and a multilingual pretrained encoder), and language-independent features (i.e. UPOS, UD relations and structures). We also use tree-based encoders to replace the sequential encoder, in order to assess whether modelling syntax is beneficial. Results show that adding UD relations and UPOS as features, despite its frustrating simplicity, leads to surprisingly strong zero-shot crosslingual semantic parsers, even when UD are the only input used during encoding. Also, UD further boost the performance of strong pretrained multilingual models. Surprisingly, small non-pretrained well-designed coarse-to-fine decoding models with UD relations and UPOS even outperform large pretrained multilingual models in some languages.
A DRS is usually represented in a nested 'box' structure. However, the systems we reference in this paper all use a linearization of this box representation. To avoid any confusion, we only show examples of linearized DRS throughout the paper.

Models
Our models are all encoder-decoder architectures that take as input a natural language sentence S = s 1 ...s |S| and output a linearized DRS L as a sequence of tokens y 1 ...y |L| . Each model differ in the particular encoder or decoder used, as described in the remainder of this section.
Coarse-to-fine Decoding Models: Our first set of models are based on the coarse-to-fine encoderdecoder architecture of Liu et al. (2018).
Encoder. We first experiment with a BiLSTM encoder (C2F-BiLSTM). In cross-lingual settings, we concatenate cross-lingual embeddings with UD relation embeddings and optionally Universal POS embeddings as model input. To model the dependency structure directly, we also test with a child-sum tree-LSTM (Tai et al., 2015) as an alternative to the BiLSTM (C2F-TreeLSTM), where each word in the input sentence corresponds to a node in the dependency tree. However, completely discarding word order and context in TreeLSTM might hurt performance. Thus, we combine tree-LSTM and BiLSTM to get C2F-Bi/TreeLSTM, where tree-LSTM inputs are initialized using the last layer of a Bi-LSTM (Chen et al., 2017).
Decoder. At decoding time we follow Liu et al. (2018) in reconstructing the linearized DRS representations in three steps coarse-to-fine, each conditioned on the previous one: first, we predict the outer DRS tags which correspond to the semantic environment (e.g. the 'boxes') that predicates will be placed in. Then, the system predicts unary and binary predicates, and in the last step their arguments. The decoder also makes use of a copying mechanism to predict those predicates that are also lemmas in the input sentence (e.g. "schläfrig"). We refer to the reader to the original paper for more details.  stead of a bracketed representation, predicates, logic operators and discourse relations, along with their arguments, are represented as a sequence of tokens separated by a special symbol "|||". Each variable argument is represented as a relative index pointing to its referent, if already introduced. We refer the reader to the original paper for more details.
To generate such linearized meaning representations, we use either a word-level LSTM encoderdecoder model with copying mechanism (LSTM-N) or alternatively, we follow  to replace the LSTM with a Transformer encoderdecoder architecture with copying mechanism (Transformer-N). Given the competitive performances of multilingual pretrained models in zeroshot NLP tasks recently (Hu et al., 2020), we also experiment with XLM-R, a state-of-the-art pretrained multilingual model. To adapt such NLU model in our encoder-decoder semantic parser, we use XLM-R to initialize encoder and randomly initialize the Transformer decoder (XLM-R-Enc-N).

Language-Independent Features
In order to make the model directly transferable to the German, Italian and Dutch test data, we use both (1) UD relations and structure (D) based on UD parses for English, German, Italian and  Dutch and (2) Universal POS tags (P) (Petrov et al., 2011). Both features are extracted using UDPipe (Straka and Straková, 2017).
All the models are based on either of the following cross-lingual representations: (1) Crosslingual word embeddings(W) where we use the MUSE (Conneau et al., 2017)

Using Language-Independent Features in Coarse-to-fine Decoding Models
To explore the roles of various languageindependent features in zero-shot cross-lingual semantic parsing, we use C2F-BiLSTM as baseline and compare it to C2F-TreeLSTM and C2F-Bi/TreeLSTM. We also conducted ablation studies on the features used. As shown in Table 1, we found that: (1) UD relation features are crucial for zeroshot cross-lingual semantic parsing. Adding UD relations significantly improves the performance in all three coarse-to-fine decoding models in all three languages, compared to using cross-lingual word-embedding alone. Models using UD relation embeddings alone (D) perform well, given that the performance does not drop much after deleting cross-lingual word embeddings. However, modeling UD structure via tree encoders does not help zero-shot cross-lingual semantic parsing consistently.
(2) UPOS features further boost the performance. After adding UPOS (P), all coarse-to-fine 1. Detailed preprocessing and evaluation are in Appendix. models perform even better, reaching state-of-theart, though the improvement is not as large as adding UD to cross-lingual word embeddings.

Using UD Relations in All Models
In Table 1, we also examine whether the large improvement made by UD relations is agnostic to models, by adding them in all six baseline models in German, Italian and Dutch. We found that UD relations lead to consistent and robust improvements. Results in PMB 2.1 show that adding UD relation features improves the performance in all three languages and six models. Although XLM-R-Enc-N model performs better than non-pretrained models with only cross-lingual embeddings, simple non-pretrained coarse-to-fine models with UD features outperform that large pretrained multilingual model in German and Italian. Considering nonpretrained models require much less model size and training/inference time in Table 2, using UD features in deliberately designed non-pretrained models has its advantage over larger pretrained multilingual models. As for PMB 3.0, UD relation features improve the performance in all models and languages as well, and the improvement is significant even in the pretrained multilingual model. Such improvement is also consistent regardless of different ways of evaluation and linearizing DRS meaning representations in PMB 2.1 and 3.0.

Error Analysis
We use C2F-BiLSTM, the best cross-lingual model in PMB 2.1, to perform error analysis to assess the quality of the prediction for operators (i.e. logic operators like "Not" as well as discourse relations "Contrast"), unary non-lexical predicates (e.g. time(t)), binary non-lexical predicates (e.g. Agent(e,x)), and lexical predicates (e.g. open(e)). Results in Table 3 show that predicting operators and binary predicates is hard, compared to the other two categories. Prediction of lexical predicates is relatively good even though most tokens in the test set were never seen during training. This can be attributed to the copying mechanism that transfers tokens from the input directly during predication.

Related work
Previous work have explored two main methods for cross-lingual semantic understanding. One method requires parallel corpora to extract alignments between source and target languages using machine translation (Padó and Lapata, 2005;Damonte and Cohen, 2017;Zhang et al., 2018;, often followed by projection of semantic representations (Reddy et al., 2017). The other method is to use parameter-shared models based on cross-lingual representations such as cross-lingual word embeddings (Duong et al., 2017;Susanto and Lu, 2017;Mulcaire et al., 2018;Hershcovich et al., 2019;Cai and Lapata, 2020), pretrained multilingual models (Zhu et al., 2020;Oepen et al., 2020), and universal POS tags (Blloshmi et al., 2020). Recently, Ozaki et al. (2020); Samuel and Straka (2020); Dou et al. (2020) conducted supervised German DRS parsing with pretrained multilingual models, but they did not explore zero-shot cross-lingual semantic parsing. Besides, although UD was proven useful in other cross-lingual tasks (Subburathinam et al., 2019), it has been under-explored in cross-lingual semantic parsing.

Conclusion
This work proposes to use simple languageindependent features for the task of zero-shot crosslingual semantic parsing. We show that simple UD and UPOS features can significantly improve the performance of cross-lingual semantic parsers based on coarse-to-fine decoding techniques or pretrained multilingual models. In the future, we plan to use such features for other semantic formalisms (e.g. AMR) and other languages (e.g. Chinese).

A Monolingual DRS Parsing
For completeness, along with the results for the cross-lingual task, we also report results for monolingual English semantic parsing in Table 4. Unlike other work on the PMB (e.g. , Liu et al. (2018) does not deal with presupposition due to constraints in converting DRS meaning representation to tree-based representations. In PMB 2.1, presupposed variables are extracted from a main box and included in a separate one. We revert this process so to ignore presupposed boxes. Similarly, we also do not deal with sense tags. For fair comparison, all other models are using the same preprocessed meaning representation in PMB 2.1.
In PMB 3.0, presuppositions and sense tags are considered. Thus, coarse-to-fine decoding models can not be used.
Note that lexical predicates in PMB are in English, even for non-English languages. Since this is not compatible with copying mechanism in coarse-to-fine decoding models, LSTM-N and Transformer-N, we revert predicates to their original language in PMB 2.1 by substituting them with the lemmas of the tokens they are aligned to. In PMB 3.0, we do not conduct such reversion, which is compatible with XLM-R-Enc-N, where there is no copying mechanism.

C Model Details
Coarse-to-fine models are based on the coarseto-fine encoder-decoder architecture of Liu et al. (2018). In order to be used as input to the parser, Liu et al. (2018) first convert the DRS into treebased representations, which are subsequently linearized into PTB-style bracketed sequences. We use the same conversion. For further details about the conversion and the model, we refer the reader to the original paper.
LSTM-N, Transformer-N and XLM-R-Enc is based on van Noord et al. (2018)'s way of losslessly linearizing DRS meaning representations and the corresponding parser. Clauses in DRS are represented as sequences without changing the order, where a special symbol "|||" is used to start a new clause and variables in clauses are represented as relative indices. Based on sentences and such linearized meaning representations, we use word-level encoder-decoder models to generating parses. We refer the reader to the original paper for further details.

D Training Details
In all models, UD relation embeddings (D) and POS tag embeddings (P) are randomly initialized and updated during training. Cross-lingual word embeddings are fixed during training and XLM-R is updated.
We adapted OpenNMT (Klein et al., 2017) for LSTM-N and Transformer-N models, while used fairseq  to implement XLM-R-Enc model.
We manually tune the hyper-parameters. For coarse-to-fine LSTM models, we use two layer BiLSTM or TreeLSTM in the encoder side. We use dropout with 0.5 as dropout rate and Adam optimizer with a learning rate of 5e-4. For LSTM-N, we use batch size 12 and the SGD optimizer with an initial learning rate of 0.7. The learning rate is decayed by 0.7 every 1500 steps. For Transformer-N, we use 6-layer Transformer. We also use a batch size of 512 tokens and the Adam optimizer with an initial learning rate of 1e-3. The learning rate is decayed by 0.9 every 1000 steps.
In the XLM-R-Enc-N model, we use XLM-R base model to initialize the encoder and randomly initialize a 12-layer transformer decoder. We use Adam as optimizer. Inspired by , we use a larger learning rate on the decoder side, in order to solve the discrepancy between pretrained encoder and non-pretrained decoder. Specifically, in the encoder side, we use a polynomial learning rate scheduler with 2e-5 as max learning rate and 5000 warmup steps. In the decoder side, we use a polynomial learning rate scheduler with 5e-5 as max learning rate and 2500 warmup steps. Total update steps are all 50000. We use a label smoothing rate 0.1 and dropout rate 0.1.
We train all models on GPU GeForce RTX 2080. The training time for Coarse-to-fine models, Transformer-N and LSTM-N models are all within 2 hours. The training time for XLM-R-Enc-N is 5-6 hours. The number of parameters in Coarse-to-fine models, Transformer-N and LSTM-N models are similar to the parameters of C2F-BiLSTM shown in Table 2 in the paper. The number of parameters in XLM-R-Enc-N is shown in Table 2 in the paper. We use English development data for evaluation, where the validation performance is similar to the performance in the English test set reported in Table 1.