XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages and Meaning Representations

Cross-Lingual Semantic Parsing (CLSP) aims to translate queries in multiple natural languages (NLs) into meaning representations (MRs) such as SQL, lambda calculus, and logic forms.However, existing CLSP models are separately proposed and evaluated on datasets of limited tasks and applications, impeding a comprehensive and unified evaluation of CLSP on a diverse range of NLs and MRs. To this end, we present XSemPLR, a unified benchmark for cross-lingual semantic parsing featured with 22 natural languages and 8 meaning representations by examining and selecting 9 existing datasets to cover 5 tasks and 164 domains. We use XSemPLR to conduct a comprehensive benchmark study on a wide range of multilingual language models including encoder-based models (mBERT, XLM-R), encoder-decoder models (mBART, mT5), and decoder-based models (Codex, BLOOM). We design 6 experiment settings covering various lingual combinations (monolingual, multilingual, cross-lingual) and numbers of learning samples (full dataset, few-shot, and zero-shot). Our experiments show that encoder-decoder models (mT5) achieve the highest performance compared with other popular models, and multilingual training can further improve the average performance. Notably, multilingual large language models (e.g., BLOOM) are still inadequate to perform CLSP tasks. We also find that the performance gap between monolingual training and cross-lingual transfer learning is still significant for multilingual models, though it can be mitigated by cross-lingual few-shot training. Our dataset and code are available at https://github.com/psunlpgroup/XSemPLR.


Introduction
Cross-Lingual Semantic Parsing (CLSP) aims to translate queries in multiple natural languages (NLs) into meaning representations (MRs) (Li et al., 2020;Xu et al., 2020a;Dou et al., 2022;Sherborne andLapata, 2021, 2022).As demonstrated in Figure 1, Cross-Lingual Semantic Parsing covers natural languages for geographically diverse users and various meaning representations, empowering applications such as natural language interfaces to databases, question answering over knowledge graphs, virtual assistants, smart home device control, human-robot interaction, and code generation.
However, current research on CLSP has three drawbacks.First, most existing research focuses on semantic parsing in English (Zelle and Mooney, 1996;Wang et al., 2015;Yu et al., 2018), limiting the development of multilingual information access systems for users in other languages.Second, current datasets have a poor coverage of NLs and MRs.Although there are encouraging efforts in developing CLSP models (Li et al., 2020;Dou et al., 2022;Sherborne and Lapata, 2022), their experiments only cover a few NLs and MRs, impeding comprehensive and unified evaluation on a diverse range of tasks.Third, due to the lack of a comprehensive CLSP benchmark, the performance of multilingual language models on CLSP is understudied.Some pretrained language models are proposed to solve cross-lingual tasks such as XLM-R (Conneau et al., 2019) and mT5 (Xue et al., 2020), while other large language models are designed for code generation such as Codex (Chen et al., 2021a) and BLOOM (Scao et al., 2022).However, little research has focused on evaluating models on CLSP.
In this paper, we propose XSEMPLR, a unified benchmark for cross-lingual semantic parsing featured with 22 natural languages and 8 meaning representations as summarized in Table 1.In order to cover a large variety of languages and meaning representations, we first select 9 high-quality CLSP datasets and then clean and format them in a unified manner.Then, we conduct a comprehensive benchmarking study on three categories of multilingual language models including pretrained encoderbased models augmented with pointer generator (mBERT, XLM-R), pretrained encoder-decoder models (mBART, mT5), and decoder-based large language models (Codex, BLOOM).To evaluate these models, we design 6 experiment settings covering various lingual combinations and learning sample scales, including Monolingual (and Monolingual Few-shot), Multilingual, and Cross-lingual Zero-Shot/Few-Shot Transfer.
Our results show that the encoder-decoder model (mT5) yields the best performance on monolingual evaluation compared with other models.Then, we pick two models with the best monolingual performance (i.e., mT5 and XLM-R) to conduct fewshot and zero-shot cross-lingual transfer learning from English to other low-resource languages.Results show a significant performance gap between monolingual training (Taget NL -> Target NL1 ) and cross-lingual transfer learning (En -> Target NL).Furthermore, we find that this gap can be significantly reduced by few-shot learning on target NL.We further train these two models in a multilingual setting and find such training can boost the performance in some of the languages, while, however, it usually hurts the performance in English.Finally, we test two large language models Codex (Chen et al., 2021a) and BLOOM (Scao et al., 2022).We find the performance gap of cross-lingual transfer learning is significant for these two models as well.
Our contributions are summarized as follows: (1) We propose XSEMPLR to unify and benchmark 9 datasets covering 5 tasks, 22 natural languages, and 8 meaning representations for cross-lingual semantic parsing; (2) We perform a holistic evaluation of 3 groups of state-of-the-art multilingual language models on XSEMPLR, demonstrating noticeable performance gaps of cross-lingual transfer models comparing English and other languages; (3) We show two effective strategies for boosting performance in low-resource languages: multilingual training and cross-lingual transfer learning.

XSEMPLR Benchmark
Figure 2 shows the construction pipeline of XSEM-PLR.We first select 9 CLSP datasets according to our design principles.Then, we collect other NLs of the selected datasets.Finally, we clean the datasets by removing outliers and performing alignment between different languages.

Design Principles
We carefully pick 9 datasets from all available semantic parsing datasets to construct XSEMPLR according to two principles.First, the picked datasets need to have high quality, which means they are either annotated by humans or augmented with careful crafting (Moradshahi et al., 2020), and the translation of user inputs are provided by humans instead of machine translation models.Second, XSEMPLR needs to be comprehensive (Hu et al., 2020), which means including diverse NLs and MRs for a broad range of tasks and applications.

Data Collection
Table 1 summarizes the characteristics and statistics of different datasets in XSEMPLR.Multilingual ATIS (MATIS) contains user questions for a flight-booking task.We collect the original English questions from ATIS (Price, 1990;Dahl et al., 1994) and add the translations from Xu et al.  and Mooney, 1996) and add other translations (Lu and Ng, 2011;Jones et al., 2012;Susanto and Lu, 2017b).GeoQuery has several MRs available.We collect Prolog and Lambda Calculus from Guo et al. (2020), FunQL from Susanto and Lu (2017b), and SQL from Finegan-Dollak et al. ( 2018)2 .Multilingual Spider (MSpider) is a humanannotated complex and cross-domain text-to-SQL datasets.We collect Spider (Yu et al., 2018) with English questions and add other NLs from Min et al. (2019) and Nguyen et al. (2020).
Multilingual NLmaps (MNLmaps) is a Natural Language Interface to query the OpenStreetMap database.We collect NLMaps (Lawrence and Riezler, 2016) in English, and add translations in German (Haas and Riezler, 2016).
Multilingual Overnight (MOvernight) is a multidomain semantic parsing dataset in lambda DCS.We include English Overnight (Wang et al., 2015) and add translations from Sherborne et al. (2020).Multilingual Schema2QA (MSchema2QA) is a question answering dataset over schema.orgweb data in ThingTalk Query Language.We include training examples with all 11 available languages and pair them with the MR in the corresponding language following Moradshahi et al. (2020) and Xu et al. (2020a).To make the dataset size comparable to others, we include 5% of the training set.
MCWQ is a multilingual knowledge-based question answering dataset grounded in Wikidata (Cui et al., 2021).We collect all questions in MCWQ in 4 languages.The split follows maximum compound divergence (MCD) (Keysers et al., 2020) so that the test set contains novel compounds to evaluate compositionality generalization ability.
MTOP is a multilingual semantic parsing dataset for task-oriented dialogs with meaning representations of hierarchical intent and slot annotations (Gupta et al., 2018;Li et al., 2020).We include examples with all 6 languages and pair the translations with the compositional decoupled representation in the corresponding language.
MCoNaLa is a multilingual code generation benchmark for Python by extending English CoNaLa (Yin et al., 2018;Wang et al., 2022).We include all 4 languages.

Data Alignment and Unification
We perform data alignment and unification over 9 datasets to construct a unified high-quality benchmark.To be specific, for the first 6 datasets introduced in Section 2.2, because each of them has multiple parts proposed in different work, we merge these parts by aligning the same user question in different languages into the same meaning representation.For the other 3 datasets, we directly use the entire samples since no other parts need to be merged.We also try to unify the language of MRs (e.g., adopting a single form of SQL queries; keeping only one English MR when there is more than one in MTOP).We also remove a few samples in MATIS and MGeoQuery with no MRs.We provide more details in Appendix including the examples of each dataset (Table 5), data construction (Appendix A), natural languages (Appendix A), and meaning representations (Appendix A).

Evaluation Metrics
We evaluate the predicted results using various automatic metrics.For the Spider dataset, we follow Yu et al. (2018) and use their proposed tool for evaluation3 .For the other datasets, we simply use exact matching, i.e., token-by-token string comparison, to see if the prediction is the same as the ground truth label.For a fair comparison with stateof-the-art models, we also use the metrics proposed in their models, including Execution Score, Denotation Accuracy, and Code BLEU (Section 4.2).

Data Analysis
Natural Languages XSEMPLR contains diverse and abundant natural languages in both highresource and low-resource groups, including 22 languages belonging to 15 language families (Appendix A).Most state-of-the-art performances are achieved in English and a few other high-resource languages.However, the lack of information in the low-resource languages brings unanswered questions to model generalization.Therefore, both these 2 types of languages are included in XSEM-PLR, to form a unified cross-lingual dataset for semantic parsing.Among these 22 languages, English is the most resourced language with many popular datasets in semantic parsing.Some languages spoken in Western Europe are also relatively high-resource languages, such as German and Spanish.We also involve many low-resource languages as well, such as Vietnamese and Thai.
Meaning Representations XSEMPLR includes 8 meaning representations for different applications: Prolog, Lambda Calculus, Functional Query Language (FunQL), SQL, ThingTalk Query Language, SPARQL, Python, and Hierarchical intent and slot.All of them can be executed against underlying databases or knowledge graphs, except for the last one which is designed for complex compositional requests in task-oriented dialogues.The first four are domain-specific because they contain specific predicates defined for a given domain, while the last four are considered open-domain and open-ontology (Guo et al., 2020).It is also worth noting that these MRs are not equivalent to their general expressiveness.For example, the ThingTalk query language is a subset of SQL in expressiveness (Moradshahi et al., 2020), and FunQL is less expressive than Lambda Calculus partially due to the lack of variables and quantifiers.

Experiment Setup
We describe our evaluation settings and models for a comprehensive benchmark study on XSEMPLR.

Evaluation Settings
We consider the following 6 settings for training and testing.
Translate-Test.We train a model on the English training data and translate target NL test data to English using the public Google NMT system (Wu et al., 2016).This setting uses one semantic parsing model trained on English but also relies on available machine translation models for other languages.This serves as a strong yet practical baseline for other settings.
Monolingual.We train a monolingual model on each target NL training data.This setting creates one model per target NL.In addition to benchmarking them, we design this setting for two reasons: (

Models
We evaluate three different groups of multilingual language models on XSEMPLR.
Multilingual Pretrained Encoders with Pointerbased Decoders (Enc-PTR).The first group is multilingual pretrained encoders with decoders augmented with pointers.Both encoders and decoders use Transformers (Vaswani et al., 2017).The decoder uses pointers to copy entities from natural language inputs to generate meaning representations (Rongali et al., 2020;Prakash et al., 2020).We use two types of multilingual pretrained encoders, mBERT (Devlin et al., 2018) and XLM-R (Conneau et al., 2019), and both are trained on web data covering over 100 languages.

Multilingual Large Language Models (LLMs).
The third group is multilingual large language mod-els based on GPT (Brown et al., 2020) including Codex (Chen et al., 2021a) and BLOOM (Scao et al., 2022).Codex is fine-tuned on publicly available code from GitHub.While it is not trained on a multilingual corpus, it has shown cross-lingual semantic parsing capabilities (Shi et al., 2022b).BLOOM is a 176B-parameter multilingual language model pretrained on 46 natural and 13 programming languages from the ROOTS corpus (Laurençon et al., 2022).We mainly use these models to evaluate the ability of few-shot learning using in-context learning without any further finetuning.Specifically, we append 8 samples and the test query to predict the MR.For Monolingual Fewshot, samples and the query are in the same NL, while for Cross-lingual Zero-shot Transfer, samples are in English and the query is in the target NL.

Results and Analysis
Table 2 shows the performance of all 6 models on 6 settings.Our results and analysis aim to answer the following research questions:

Analysis of Monolingual
We obtain the following main findings on Monolingual setting: Enc-Dec (mT5) obtains the best performance.
Next, we evaluate mT5 model on Translation-Test setting.As shown in the table, mT5 in Monolingual setting outperforms Translation-Test by a large margin (58.16 vs. 36.74).This shows that multilingual language models are more effective than Translation-Test methods.In other words, it is necessary to train a multilingual model even though we have a high-quality translation system.

Comparison with SOTA
Table 3 lists the performance of mT5 in Monolingual setting with the previous state-of-the-art.Some of the previous work use denotation accuracy and execution accuracy which are different from the exact match we use.To make our results comparable with previous work, we apply the evaluation tools of previous work to XSEMPLR.As shown in the table, Enc-Dec (mT5) outperforms previous work on all NLs of MSchema2QA, MCWQ, MNLMaps, MATIS datasets and obtains comparable results on the others.

Analysis of Codex and BLOOM
We evaluate Codex and BLOOM to test the performance of in-context learning of large language models.As shown in

Comparison between Few-shot Settings
We also test the Enc-Dec (mT5) and Enc-PTR (XLM-R) models on two types of few-shot experiments, including Monolingual and Cross-lingual Few-Shot.
As can be seen, mT5 of cross-lingual few-shot outperforms monolingual few-shot by a large 22.21 exact match score (excluding MCoNaLa), while XLM-R has a smaller gain of 15.12.We can summarize two observations: 1) pretraining on the English NL can significantly boost the performance of few-shot on target NLs (En + Target Few-shot -> Target NL), and 2) the model with higher crosslingual capability gains more improvement, such as mT5 gains more than XLM-R.Both observations demonstrate the capability of cross-lingual models to transfer knowledge from the source to the target NLs.

Analysis of Multilingual Training
We compare the performance of Monolingual and Multilingual settings.As can be seen in  We further explore the reason for the performance drop in multilingual training.As shown in Figure 3, most of the major NLs can obtain performance gain, except that English performance drops in 7 datasets and gains in 3 datasets.This is known as "Curse of Multilinguality" (Pfeiffer et al., 2022).Similarly in CLSP, performance of English (high resource NLs) is easier to drop in multilingual training.

Cross-lingual Performance Gap
To examine the transfer ability of the cross-lingual models, we investigate the performance difference between the Monolingual and Cross-lingual Few/Zero-shot for each dataset using mT5.As shown in Figure 4, by examining the distance between green and orange lines, we find that for the zero-shot setting, the cross-lingual transfer performance gap is significant, which is even larger than 50% on the NLmaps dataset, demonstrating the limitation of current cross-lingual models.However, by examining the difference between orange and blue lines, we also find that using even 10% of samples in target data, the transfer gap will be shortened rapidly.The few-shot gap usually shrinks to around half of the zero-shot gap, e.g., the Schema2QA dataset.For MATIS, the gap even shrinks to around 5 which is very close to the performance of the monolingual setting.Chinese/German has the largest/smallest performance loss for transfer learning.Additionally, performance and pretraining data size have no evident correlation.

Analysis over Natural Languages
We pick the best model mT5 and analyze its performance in the zero-shot setting in Figure 5. Results show that the performance of Chinese transfer learning (En -> Zh) and English monolingual training (En -> En) usually is the largest compared with transfer learning of other NLs.On the other hand, German usually has the smallest transfer performance loss.This is probably because of two reasons.First, the Chinese data source is less than German when pretraining on mT5.Second, the language family of English is closer to German (IE: Germanic) compared with Chinese (Sino-Tibetan).This phenomenon is discussed in Hu et al. (2020), and we find this conclusion also holds for crosslingual semantic parsing tasks.

Analysis over Meaning Representations
Table 4 shows the performance of mT5 on various MRs in MGeoQuery.In almost all languages, FunQL outperforms the other three meaning representations, and SQL obtains the worst performance.This is consistent with the observation of Guo et al. (2020).We speculate that the possible reason is: (1) the grammar of SQL is more complex than the others, and FunQL enjoys much easier grammar (Li et al., 2022), and (2) FunQL contains a number of brackets that provide information of structure to the models (Shu et al., 2021).1996), ATIS (Finegan-Dollak et al., 2018), Overnight (Wang et al., 2015), and Spider (Yu et al., 2018).Cross-lingual Semantic Parsing datasets are usually constructed by translating the English user questions into other languages (Dou et al., 2022;Athiwaratkun et al., 2022).For example, Lu and Ng (2011)  However, existing CLSP datasets follow different formats and are independently studied as separate efforts.We aim to provide a unified benchmark and modeling framework to facilitate systematic evaluation and generalizable methodology.
Multilingual Language Models There has been significant progress in multilingual language models.MUSE (Conneau et al., 2017) aligns monolingual word embeddings in an unsupervised way without using any parallel corpora.XLM (Lample and Conneau, 2019) is a pretrained language model based on RoBERTa (Liu et al., 2019) which offers cross-lingual contextualized word representations.Similarly, mBERT is developed as the multilingual version of BERT Devlin et al. (2018).XLM-R (Conneau et al., 2019) outperforms mBERT and XLM in sequence labeling, classification, and question answering.Focusing on sequence-to-sequence tasks such as machine translation, mBART (Liu et al., 2020) extends BART by introducing multilingual denoising pretraining.mT5 (Xue et al., 2020) extends T5 by pretraining on the multilingual dataset mC4.Multilingual large language models have been proposed such as BLOOM (Scao et al., 2022) and XGLM (Lin et al., 2022).From multilingual embeddings to multilingual large language models, there have been more effective representations as well as more languages covered (Srivastava et al., 2022).We aim to systematically evaluate these models on CLSP, which is understudied by existing work.
Cross-lingual NLP Benchmarks Cross-lingual benchmarks have been established in many NLP tasks.XNLI is a large-scale corpus aimed to provide a standardized evaluation set (Conneau et al., 2018).Hu et al. (2020) developed XTREME to evaluate how well multilingual representations in 40 languages can generalize.XGLUE is another dataset used to implement evaluation in various cross-lingual tasks (Liang et al., 2020).MLQA (Lewis et al., 2019), XQuAD (Artetxe et al., 2019), and XOR QA (Asai et al., 2020) are three evaluation frameworks targeted for cross-lingual question answering.Concerning Cross-lingual Information Retrieval (Zbib et al., 2019;Oard et al., 2019;Zhang et al., 2019;Shi et al., 2021;Chen et al., 2021b), Sun and Duh (2020) introduce CLIRMatrix by collecting multilingual datasets from Wikipedia.For Cross-lingual Summarization, NLCS was built by Zhu et al. (2019) to tackle the problem of the divided summarization and translation.Nonetheless, there is no standardized and unified benchmark for CLSP, and thus we are unable to calibrate the performance of popular multilingual language models on CLSP.

Conclusion
We build XSEMPLR, a unified benchmark for cross-lingual semantic parsing with multiple natural languages and meaning representations.We conduct a comprehensive benchmark study on three representative types of multilingual language models.Our results show that mT5 with monolingual training yields the best performance, while notably multilingual LLMs are still inadequate to perform cross-lingual semantic parsing tasks.Moreover, the performance gap between monolingual training and cross-lingual transfer learning is still significant.These findings call for both improved semantic parsing capabilities of multilingual LLMs and stronger cross-lingual transfer learning techniques for semantic parsing.

Limitations
While we cover a wide range of different factors of cross-lingual semantic parsing (e.g., tasks, datasets, natural languages, meaning representations, domains), we cannot include all possible dimensions along with these aspects.Furthermore, we focus on the linguistic generalization ability for semantic parsing because the questions are translated from the English datasets.In the future, we will explore questions raised by native speakers in each language to study the model ability under variations in cultural backgrounds and information-seeking needs.

A Data Construction Details
In this section, we introduce the details of data collection, natural languages, meaning representations, and dataset statistics.

A.1 Data Collection
Multilingual ATIS ATIS (Price, 1990;Dahl et al., 1994) contains user questions for a flightbooking task.The original user questions are in English.We add the translations in Spanish, German, French, Portuguese, Japanese, Chinese from Xu et al. (2020b).Furthermore, Upadhyay et al. (2018) provide translations in Hindi and Turkish but only for a subset of utterances.Susanto and Lu (2017a) provide translations in Indonesian and Chinese, and Sherborne et al. (2020) provide translations in Chinese and German, but neither is available through LDC.Therefore, we don't include these.For meaning representations, we focus on the task of NLI to databases and thus collect SQL from Iyer et al. (2017);Finegan-Dollak et al. (2018), while there are other formats available such as logical forms (Zettlemoyer and Collins, 2012) and BIO tags for slot and intent (Upadhyay et al., 2018).To unify SQL formats across datasets, we rewrite the SQL queries following the format of Spider (Yu et al., 2018).We follow the question splits from Finegan-Dollak et al. (2018).Through manual inspection, we discard 52 examples which do not have aligned translations from Xu et al. (2020b).This gives 5228 examples with 4303 training, 481 dev, and 444 test.
Multilingual GeoQuery GeoQuery (Zelle and Mooney, 1996) contains user questions about US geography.The original user questions are in English.One of the earliest work on cross-lingual semantic parsing is on the Chinese version of Geo-Query created by Lu and Ng (2011).Later, Jones et al. (2012) create German, Greek, and Thai translations, and Susanto and Lu (2017b) create Indonesian, Swedish, and Farsi translations.We include all these 8 languages.Furthermore, GeoQuery has several meaning representations available.To include multiple meaning representations, we collect Prolog and Lambda Calculus from Guo et al. (2020), FunQL from Susanto and Lu (2017b), and SQL from Finegan-Dollak et al. (2018).To unify SQL formats across datasets, we rewrite the SQL queries following the format of Spider (Yu et al., 2018).We follow the question splits from Finegan-Dollak et al. (2018).Through manual inspection, we discard 3 examples that do not have corresponding FunQL representations.This gives 874 examples with 548 training, 49 dev, and 277 test.
Multilingual Spider Spider (Yu et al., 2018) is a human-annotated complex and cross-domain textto-SQL datasets.The original Spider uses English utterances and database schemas.To include utterances in other languages, we include the Chinese version (Min et al., 2019) and the syllable-level Vietnamese version (Nguyen et al., 2020).In this way, each SQL query is paired with a database schema in English and an utterance in three languages.Because the test set is not public, we include only the training and dev set.We also exclude GeoQuery examples from its training set because we use the full version of GeoQuery separately.This creates 8095 training examples and 1034 dev examples following the original splits (Yu et al., 2018).

Multilingual NLmaps NLMaps (Lawrence and
Riezler, 2016) is a Natural Language Interface to query the OpenStreetMap database about geographical facts.The original questions are in English, and later Haas and Riezler (2016) provide translations in German.The meaning representation is Functional Query Language designed for Open-StreetMap, which is similar to FunQL of GeoQuery.We follow the original split with 1500 training and 880 test examples.
Multilingual Overnight Overnight (Wang et al., 2015) is a multi-domain semantic parsing dataset with lambda DCS logical forms executable in SEM-PRE (Berant et al., 2013).The questions cover 8 domains in Calendar, Blocks, Housing, Restaurants, Recipes, Publications, Social, Basketball.The original dataset is in English, and Sherborne et al. (2020) provide translation in German and Chinese.They use machine translation for the training set and human translation on the dev and test sets.We include the Baidu Translation for Chinese and Google Translate for German.We merge all the domains together as a single dataset and follow the original split with 8754 training, 2188 dev, and 2740 test examples.
MCWQ MCWQ (Cui et al., 2021) is a multilingual knowledge-based question answering dataset grounded in Wikidata.This is created by adapting the CFQ (Compositional Freebase Questions) dataset (Keysers et al., 2019) by translating the queries into SQARQL for Wikidata.The questions are in four languages including Hebrew, Kannada, Chinese, and English.The split follows maximum compound divergence (MCD) so that the test set contains novel compounds to test compositionality generalization ability.We follow the MCD3 splits with 4006 training, 733 dev, and 648 test examples.
Multilingual Schema2QA Schema2QA (Xu et al., 2020a) is an open-ontology question answering dataset over scraped Schema.orgweb data with meaning representations in ThingTalk Query Language.Moradshahi et al. (2020) extend the original dataset with utterances in English, Arabic, German, Spanish, Farsi, Finnish, Italian, Japanese, Polish, Turkish, Chinese.The questions cover 2 domains in hotels and restaurants.The training examples are automatically generated based on templatebased synthesis, crowdsourced paraphrasing, and machine translation.The test examples are crowdsourced and manually annotated by an expert with human translations.We include training examples with all 11 languages available and pair the translations with the query in corresponding language.To make the dataset size comparable to others, we include 5% of the training set.This gives 8932 training examples and 971 test examples.We also include a no-value version of the query, because the entities in the translated utterances are localized in the new languages and thus do not align well with the values in English queries.
MTOP MTOP (Li et al., 2020) is a multilingual task-oriented semantic parsing dataset with meaning representations based on hierarchical intent and slot annotations (Gupta et al., 2018).It covers 11 domains in Alarm, Calling, Event, Messaging, Music, News, People, Recipes, Reminder, Timer, Weather.It includes 6 languages in English, German, French, Spanish, Hindi, Thai.We include examples with all 6 languages available and pair the translations with the compositional decoupled representation in corresponding language.This gives 5446 training, 863 dev, 1245 test examples.
MCoNaLa MCoNaLa (Wang et al., 2022) is a code generation benchmark which requires to generate Python code.It collects English examples from the English Code/Natural Language Challenge (CoNaLa (Yin et al., 2018)) dataset and further annotates a total of 896 NL-code pairs in three languages, including Spanish, Japanese, and Russian.The training and dev set contains 1903 and 476 English examples, separately.

A.3 Meaning Representation Details
Prolog uses first-order logic augmented with higherorder predicates for quantification and aggregation.Lambda Calculus is a formal system for computation, and it represents all first-order logic and naturally supports higher-order functions with constants, quantifiers, logical connectors, and lambda abstract.FunQL is a variable-free language, and it encodes compositionality using nested functionargument structures.SQL is the query language based upon relational algebra to handle relations among entities and variables in databases.The last two, ThingTalk Query Language (Xu et al., 2020a) and Hierarchical intent and slot (Gupta et al., 2018) are recently proposed for Question Answering on Web and Task-Oriented Dialogue State Tracking, respectively.Python is a high-level, general-purpose programming language.Its design philosophy emphasizes code readability with the use of significant indentation.

A.4 Dataset Statistics
Figure 6 shows the statistics of the dataset.As can be seen, the top 3 NLs with the most samples in XSEMPLR are English, Chinese, and German, while the top 3 MRs are Lambda, SQL, and ThingTalk.

B Experiment Details
We introduce the training settings and input/output format for all experiments and settings in this section.

B.1 Training Settings
For experiments on LSTM model (Table 7), we use OpenNMT5 as the implementation.For Transformer-PTR models, we use Pytorch6 as the implementation.For Codex and BLOOM models, we use OpenAI API7 and Huggingface API8 , respectively, and for mT5 and mBART models, we leverage Huggingface9 as implementation.For each model, we train 300 epochs on MGeoquery due to the less number of training instances in this MCoNaLa タ プ ルdataを 空 白 区切りで表示する for i in data: print(' '.join(str(j) for j in i)) Table 5: Examples of each dataset in XSEMPLR including diverse languages and meaning representations.ATIS: Portuguese-SQL, Geoquery: Farsi-Prolog, Spider: Vietnamese-SQL, NLmaps: German-FunQL, Overnight: English-Lambda Calculus, MCWQ: Hebrew-SPARQL, Schema2QA: Arabic-ThingTalk Query Language, MTOP: Hindi-Hierarchical Intent and Slot, MCoNaLa: Japanese-Python.
dataset and 100 epochs on the rest of the datasets.The learning rate is chosen from {1e-5, 3e-5, 5e-5, 1e-4} according to the parameter search on the dev set.
For Codex and BLOOM, the maximum length of the generated sequence is set to 256 tokens.For Codex, the temperature is set to 0. For BLOOM, if the generated result does not contain complete MR, we append the generated results to the input and redo the generation and repeat this process over again until the generated result is complete.However, the maximum API call of one sample is set to 5 times.After 5 calls, we use the generated result as the final result.We use default settings for the rest of the parameters.
We run the model on 8 RTX A6000 GPUs, and it takes from hours to several days according to the data size.The model architecture from Huggingface is mT5-large, mBART-large, and mBERT-base.For Codex and BLOOM, we use code-davinci-002 10 , and bigscience/bloom.The batch size is set to 16 for training mT5/mBART and 32 for train-10 code-davinci-002 has been deprecated ing Transformer-PTR models.

B.2 Input/Output Format
For input of the Transformer-PTR models, we directly feed the query into the model.For MSpider, we append the table to the end of the sequence with the format "[CLS] Query [SEP] Schema name [SEP]   As for the output, we scan the tokens in the label and replace the ones that appear in the source text with "@ptrN" where "N" is a natural number showing the index of the token in the source text.We remove the "FROM" clause in SQL.In this way, the pointer network can easily identify which tokens are copied from the source.For instance: instance: [ CLS ] select count ( @ptr19 )  The expected output is the MR with a starting symbol "#".

B.3 Experiment Path
The experiments are done in the following order: we first evaluate 2 Enc-PTR and 2 Enc-Dec baseline models in the Monolingual setting.Then, we pick two of them with the best performance to evaluate on all the other settings.Finally, we evaluate LLMs using in-context learning in two finetuningfree settings.

C Results and Discussions
This section lists the results for each NL and MR and introduces the comparison with SOTA, training data size and few-shot learning, and error analysis.

C.1 Results for Each NL and MR
We list some of the results of our models on various datasets and languages.and compare the golden MR and the predictions generated by mT5.We classify the errors into 4 types: • Syntax error: The prediction contains a syntax error.In other words, SQL engine can parse the predictions because of the grammar issues.
• Token error: one of the two types of semantic errors.Predictions contain wrong column names, values (such as strings and numbers), and operators (not including keywords).
• Structure error: one of the two types of semantic errors.Predictions contain wrong structures.It means some keywords of SQL are incorrect or missing.
• Incorrect Exact Match: although the exact match shows the prediction is different from the golden one, the execution results are the same.
As shown in Table 6, most of the errors are semantic errors (64.27%) in which the structure error is around two times of token error (41.42% vs. 22.85%).Syntax error and incorrect exact match occupy around 18% of errors respectively.

Figure 1 :
Figure 1: Overview of Cross-Lingual Semantic Parsing over various natural languages and meaning representations.

Figure 3 :
Figure 3: Effect of multilingual training with mT5 on different NLs.X-axis is the NL that was included in at least two datasets.Y-axis is the number of datasets that the performance increases/decreases of this NL after multilingual training.Performance of English (high resource NLs) are easier to drop in multilingual training.

Figure 4 :
Figure 4: The performance of cross-lingual Few/Zeroshot (mT5) on different datasets and languages.MGeo-Query/* indicates a single MR; MGeoQuery is the averaged score across 4 MRs.Each neighbor grey circle has a 10 score difference, and the center of the circle indicates a 0 score.The cross-lingual transfer performance gap is significant for the zero-shot setting.However, few-shot training shrinks this gap greatly.

Figure 5 :
Figure 5: Left vertical axis: The performance of crosslingual zero-shot mT5 models on different datasets over different languages.Larger dots indicate higher accuracy.Right vertical axis: Red line indicates the percentage of different languages in the mT5 training data.Chinese/German has the largest/smallest performance loss for transfer learning.Additionally, performance and pretraining data size have no evident correlation.
translate GeoQuery English queries to create a Chinese version.Min et al. (2019) and Nguyen et al. (2020) create Chinese and the Vietnamese translation of Spider.

Figure 6 :
Figure 6: Distribution of 22 natural languages and 8 meaning representations.Each number of bar represents the sum of samples across all datasets.

#
the types of the policy used by the customer named " Dayana Robel ". # The information of tables : ----6 Tables Ignored ----# Translation results are as follows :

Figure 7 :
Figure 7: Exact Matching (EM) scores on the MGeo-Query dataset using mT5 as a monolingual learner.

Table 1 :
Datasets in XSEMPLR.We assemble 9 datasets in various domains for 5 semantic parsing tasks.It covers 8 meaning representations.The questions cover 22 languages in 15 language families.Train/Dev/Test columns indicate the number of MRs each paired with multiple NLs.

Table 2 :
Results on XSEMPLR.We consider 6 settings including 2 Monolingual, 1 Multilingual, and 2 Cross-lingual settings, and one Translate-Test setting.Each number is averaged across different languages in that dataset.† Codex/BLOOM are evaluated on only two settings as we apply 8-shot in-context learning without finetuning the model parameters.‡ Two settings are not applicable to MCoNaLa because it has no training set on NLs other than English.⋆ Translate-Test performances on MSchem2QA and MTOP are especially low because the MR of these data also contains tokens in target languages.

Table 2
, LLMs (Codex and BLOOM) are outperformed by mT5 model by a large margin for both Few-shot (11.79/24.60)and Zero-shot (15.13/28.39)settings.This suggests that multilingual LLMs are still inadequate for crosslingual semantic parsing tasks.

Table 1 [
SEP] Table 2 ...", each table is represented by "table name.columnname".We add "table name.*" to each table to represent all columns.
Han et al., 2022).Specifically, we concatenate 8 pairs of examples and a query as the input.For MSpider, we additionally list the information of the schema including table names and column names of each example.It is worth noting that the number of examples of BLOOM for in-context learning decrease to 4 on MATIS dataset and decreases to 1 on MSpider dataset because the number of tokens exceeds the input limit.The example of MSpider input is listed as follows: Table name is : Songs .The table columns are as follows : SongId , Title # 1. Table name is : Albums .The table columns are as follows : AId , Title , Year , Label , Type # 2. Table name is : Band .The table columns are ----3 Tables Ignored ----# 6.Table name is : Vocals .The table columns are as follows : SongId , Bandmate , Type # Translation results are as follows :

Table 6 :
Table 7, 8, 9, 11, 10show the Monolingual performance of LSTM, mBERT+PTR, XLM-R+PTR, mBART, and mT5.Table12, 13, 14, 15 shows the Monolingual Few-Shot performance of XLM-R+PTR, mT5, Codex, and BLOOM.Table16, and 17 show the Multilingual performance of XLM-R+PTR, and mT5.Table 18, 19, 20, 21show the Cross-lingual Zero-Shot Transfer performance of XLM-R+PTR, mT5, Codex, and BLOOM.Table22, 19 show the Crosslingual Few-Shot Transfer performance of XLM-R+PTR, and mT5.C.2 Training Data Size and Few-shot LearningFigure7displays the averaged Exact Matching scores (EM) across all languages on MGeoQuery dataset, where each line represents a meaning representation, and each dot on the line represents a few-shot experiment using such meaning representation.The X-axis is the percentage of data we use to train the model.Results show that the performance was largely influenced by the number of samples in the training set.The performance can be as high as 70% if given sufficient data, while training on 10% of training data may lead to 0 scores.Besides, among all four MRs, the performance of FunQL increases most steadily, showing its robustness.C.3 Error AnalysisWe conduct error analysis on MGeoQuery dataset.First, we select the English split with SQL MR, Error analysis on MGeoQuery English test set.The MR is SQL.