DAMP: Doubly Aligned Multilingual Parser for Task-Oriented Dialogue

Modern virtual assistants use internal semantic parsing engines to convert user utterances to actionable commands. However, prior work has demonstrated multilingual models are less robust for semantic parsing compared to other tasks. In global markets such as India and Latin America, robust multilingual semantic parsing is critical as codeswitching between languages is prevalent for bilingual users. In this work we dramatically improve the zero-shot performance of a multilingual and codeswitched semantic parsing system using two stages of multilingual alignment. First, we show that contrastive alignment pretraining improves both English performance and transfer efficiency. We then introduce a constrained optimization approach for hyperparameter-free adversarial alignment during finetuning. Our Doubly Aligned Multilingual Parser (DAMP) improves mBERT transfer performance by 3x, 6x, and 81x on the Spanglish, Hinglish and Multilingual Task Oriented Parsing benchmarks respectively and outperforms XLM-R and mT5-Large using 3.2x fewer parameters.


Introduction
Task-oriented dialogue systems are the backbone of virtual assistants, an increasingly common direct interaction between users and Natural Language Processing (NLP) technology. Semantic parsing converts unstructured text to structured representations grounded in task actions. Due to the conversational nature of the interaction between users and task-oriented dialogue systems, speakers often use casual register with regional variation. Such variation is an essential challenge for the inclusiveness and reach of virtual assistants which aim to serve a global and diverse userbase (Liu et al., 2021).
In this work, we are motivated by a common form of variation for bilingual speakers (Dogruöz et al., 2021): codeswitching. Codeswitching occurs in two forms which both affect task-oriented dialogue. Inter-sentential codeswitching appears through multilingual requests made by the same user during a dialogue: Play all rap music on my iTunes Toca toda la música rap en mi iTunes Intra-sentential codeswitching appears when the user switches languages during a single query: Play toda la rap music en mi iTunes Both forms are used by bilingual speakers (Joshi, 1982;Dey and Fung, 2014) and cause location, primary language preference, and even language identification to be unreliable mechanisms for routing requests to an appropriate monolingual system (Barman et al., 2014). This makes zero-shot codeswitching performance a key feature of system robustness rather than simply a mechanism to reduce annotation costs.
However, zero-shot structured prediction and parsing is still a challenge for state-of-the-art multilingual models (Ruder et al., 2021), highlighting the need for improved methods beyond scale to achieve this goal. Fortunately, as a fundamental property of the task, these linguistically diverse inputs are grounded in a shared semantic output space. Each of the above outputs corresponds to: The grounded nature of semantic parsing makes cross-lingual alignment natural for the task.
We propose using both contrastive alignment pretraining and a novel constrained adversarial finetuning method to perform double alignment, shown in Figure 1. Our Doubly Aligned Multilingual Parser (DAMP) achieves strong zero-shot performance on both multilingual (inter-sentential) and . intra-sentential codeswitched data, making it a robust model for bilingual users without harming English performance. We contribute the following: 1. Alignment Pretraining Effectiveness: We that while multilingual BERT (mBERT) is ineffective for both categories of codeswitched data. Contrastive alignment, however, dramatically pretraining with sentence-aligned monolingual data improves English, multilingual, and intra-sentential codeswitched semantic parsing performance. 2. Constrained Adversarial Alignment: We propose utilizing domain adversarial training to further improve alignment and transferability without labeled or aligned data. We introduce a novel constrained optimization method and demonstrate that it improves over prior domain adversarial training algorithms (Sherborne and Lapata, 2022) and regularization baselines (Li et al., 2018; Wu and Dredze, 2019) without hyperparameter tuning. Finally, we find that alignment causes accidental translation with pretrained decoders, highlighting the advantages of pointer-generator networks. 3. Interpreting Alignment Improvements: Additionally, we find the improved parsing ability of DAMP is driven by a 6x improvement in prediction accuracy of the initial intent. Finally, we measure improvements in alignment using a post-hoc linear probe on language prediction in addition to qualitative analysis of embedding visualizations.
MMTs are remarkably robust for multilingual and intra-sentential codeswitching benchmarks (Aguilar et al., 2020;Hu et al., 2020;Ruder et al., 2021). However, the gap between performance on the training language and zero-shot targets is larger in task-oriented parsing benchmarks (Li et al., 2021;Agarwal et al., 2022;Einolghozati et al., 2021), similar to the large discrepancy for other syntactically intensive tasks (Hu et al., 2020).
Our work applies the pretraining regime from Hu et al. (2021), which adds multiple explicit alignment objectives to traditional MMT pretraining. We show that this technique is effective both for semantic parsing, a new task, and intra-sentential codeswitching, a new linguistic domain.
Domain Adversarial Training The concept of using an adversary to remove undesired features has been discovered and applied separately in transfer learning (Ganin et al., 2016), privacy preservation (Mirjalili et al., 2020, and algorithmic fairness (Zhang et al., 2018a). When applying this technique to transfer learning, Ganin et al. (2016) term this domain adversarial training.
Due to its effectiveness in domain transfer learning, multiple works have studied applications of domain adversarial learning to cross-lingual transfer (Guzman-Nateras et al., 2022;Lange et al., 2020;Joty et al., 2017). Most relevant, Sherborne and Lapata (2022) combine a multi-class language discriminator with translation loss to improve crosslingual transfer.
In this space, we contribute the 4 following novel findings. Firstly, we show that binary discrimination is more effective than multi-class discrimination and provide intuitive reasoning for this surprising phenomenon. Secondly, we show that adversarial alignment can increase the accidental translation phenomena (Xue et al., 2021) in models with pretrained decoders. Thirdly, we show that tokenlevel adversarial discrimination improves transfer to intra-sentential codeswitching. Finally, we remove the challenge of zero-shot hyperparameter search with a novel constrained optimization technique that can be configured a priori based on our alignment goals.
Preventing Multilingual Forgetting Beyond adversarial techniques, prior work has used regularization to maintain multilingual knowledge learned only during pretraining. Li et al. (2018) shows that penalizing distance from a pretrained model is a simple and effective technique to improve transfer. Using a much stronger inductive bias, Wu and Dredze (2019) freezes early layers of multilingual models to preserve multilingual knowledge. This leaves later layers unconstrained for task specific data. We show that DAMP outperforms these baselines, the first comparison of traditional regularization to adversarial cross-lingual transfer.

Methods
We utilize two separate stages of alignment to improve zero-shot transfer in DAMP. During pretraining, we propose to use contrastive learning to improve alignment amongst pretrained representations. During finetuning, we add double alignment through domain adversarial training using a binary language discriminator and a constrained optimization approach. We apply these improvements to the encoder of a pointer-generator network that copies and generates tags to produce a parse.

Baseline Architecture
Following Rongali et al. (2020), we use a pointer-generator network to generate semantic parses. We tokenize words [w 0 , w 1 . . . , w m ] from the labeling scheme into sub-words [s 0,w 0 , . . . , s n,w 0 , s 0,w 1 . . . , s n,wm ] and retrieve hidden states [h 0,w 0 , . . . , h n,w 0 , h 0,w 1 . . . , h n,wm ] from our encoder. We use the hidden state of the first subword for each word to produce word-level hidden states: Using 1 as a prefix, we use a randomly initialized auto-regressive decoder to produce representations At each action-step a, we produce a generation logit vector using a perceptron to predict over the vocabulary of intents and slot types g a and a copy logit vector for the arguments from the original query c a using similarity with Eq. 1: Finally, we produce a probability distribution p a across both generation and copying by applying the softmax to the concatenation of our logits and optimize the negative log-likelihood of the correct prediction a : Since our double alignment procedure removes language-specific information from the hidden state passed to the decoder, this copy mechanism is essential to generating parses with multilingual tokens, as shown in Section 5.3.

Alignment Pretraining
We evaluate the contrastive pretraining process AMBER introduced by Hu et al. (2021) for semantic parsing. AMBER combines 3 explicit alignment objectives: translation language modeling, sentence alignment, and word alignment using attention symmetry.
Translation language modeling was originally proposed by Conneau and Lample (2019). This technique is simply traditional masked language modeling, but uses parallel sentences as input and masking tokens in each language. Since masked words can be unmasked in the parallel sentences, this encourages the model to align word and phrase level representations so that they can be used interchangeably across languages.
Sentence alignment (Conneau et al., 2018) directly optimizes similarity of representations across languages using a siamese network training process. Given an English sentence with pooled representation e i , the model maximizes the negative log-likelihood of the probability assigned to true translation t compared to a batch of possible translations B: Finally, AMBER encourages encourages word level alignment by optimizing with an attention symmetry loss (Cohn et al., 2016). For attention head h ∈ H, a sentence in language S, and its translation in language T , we maximize similarity of the cross-attention matrices A h S→T and A h T →S :

Cross-Lingual Adversarial Alignment
We build on the domain adversarial training process of Ganin et al. (2016). First, we use a token-level language discriminator to get aligned representations at the word level. Unlike prior work, we propose and justify a binary scheme that treats all languages not found in the training data as a single class rather than distinguishing each language separately. Finally, we introduce a general constrained optimization approach for adversarial training and apply it to cross-lingual alignment.
Token-Level Discriminator Similar to Ganin et al. (2016), we train a discriminator to distinguish between in-domain training data and unlabeled outof-domain data. Our method assumes access to labeled training queries in one language, in this case English, and unlabeled queries in multiple other languages which target the same intents and slots. Data is sampled evenly from all languages to create an adverserial dataset with equal amounts of each language. We use a two-layer perceptron to predict the probability p = P (E|h 0,wn ) that a token with true label y is English or Non-English given hidden representations from Eq. 1. Our discriminator loss is traditional binary cross-entropy loss: This varies from prior work using domain adversarial training for multilingual robustness (Lange et al., 2020; Sherborne and Lapata, 2022) which performs multi-class classification across all languages and uses the negative log-likelihood of the correct class as the loss function. While this loss function is intuitively correct for the discriminator, it allows the generator to optimize towards maxima which do not benefit multilingual transfer.
First, suppose we have labeled data in English and unlabeled data in both Spanish and French. The goal of the multi-class adversary is to predict English, Spanish, or French for each token while the encoder is to minimize the ability of the adversary to recover the correct language. Even before adversarial training, the adversary is likely to struggle with tokens that are already well aligned across languages.
For example, "dormir" in the Spanish sentence "recuérdame ir a dormir temprano (remind me to go to sleep early)" will be well aligned between French and Spanish since "dormir" translates to "to sleep" in both languages. This means the encoder can simply maintain alignment for the token "dormir" across French and Spanish, making it impossible for the adversary to recover the correct language. Doing so maximizes the multi-class adversarial loss but does not improve alignment between "dormir" and the English "to sleep" in our labeled data. In this extreme example, we highlight that multi-class alignment can be maximized without improving transferability from English.
Using a binary "English" vs. "Non-English" classifier removes such inoptimal solutions. Since both Spanish and French are now labeled "Non-English", the generator receives no reward for alignment between the two. This means the model with a binary adversary must align both French and Spanish to English, rather than to one another.
Constrained Optimization Traditionally, domain adversarial training uses a gradient reversal layer (Ganin et al., 2016) to allow the generator to maximize adversary loss L d weighted by hyperparameter λ while minimizing task loss L s . For the generator, this is effectively equivalent to optimizing a linear combination of the terms: Selecting a schedule for λ presents a challenge in the zero-shot setting. Since the reverse validation procedure used to select the λ schedule by Ganin et al. (2016) assumes only one target domain, multilingual works such as Sherborne and Lapata (2022) opt to simply perform a linear search using the in-domain development set s. This approach ignores transfer performance entirely when weighing adversary loss. Instead, we propose a novel constrained optimization method which balances adversarial and task loss automatically using a constraint derived from first-principles.
Our goal is to obtain token representations that are exactly aligned across languages. Any well-fit adversary will predict English with P = 0.5 on such data and receives a loss of 0.3 since it cannot perform better than chance. In equilibrium, the generator cannot increase loss above 0.3 since the adversary can simply predict P = 0.5 for all inputs regardless of the ground truth labels.
This reasoning provides us a clear constraint. In alignment, the L d should be no less than 0.3, which we call . We then optimize the task loss L s while enforcing this constraint. We do so with minimal additional computation cost and using backpropagation alone with the differential method of multipliers (Platt and Barr, 1987). The differential method of multipliers first relaxes the constrained problem to its Lagrangian dual: Unlike Sherborne and Lapata (2022), this lets us treat λ as a learnable parameter and optimize it to maximize the value of λ( − L d ) with stochastic gradient ascent. In plain terms, our optimization increases the value of λ when > L d and decreases it when < L d . This produces a schedule for λ which weights the adversarial penalty only when it is accurate. In Appendix 3, we show how λ evolves throughout training to maintain the constraint.

Experiments
We evaluate the effects of our techniques on three benchmarks for task-oriented semantic parsing with hierarchical parse structures. Two of these datasets evaluate robustness to intra-sentential codeswitching (Einolghozati et al., 2021;Agarwal et al., 2022) and the third uses multilingual data to evaluate robustness to inter-sentential codeswitching (Li et al., 2021). Examples are divided as originally released into training, evaluation, and test data at a ratio of 70/10/20.

Datasets
Multilingual Task Oriented Parsing (MTOP) Li et al. (2021) introduced this benchmark to evaluate multilingual transfer for a difficult compositional parse structure. The benchmark contains queries in English, French, Spanish, German, Hindi, and Thai. Zero-shot performance on this benchmark to evaluate inter-sentential codeswitching robustness. Each language has approximately 15,000 total queries which cover 11 domains with 117 intents and 78 slot types.

Hindi-English Task Oriented Parsing (CST5) Agarwal et al. (2022) construct a benchmark of
Hindi-English intra-sentential codeswitching data using the same label space as the second version of the English Task  benchmark of Spanish-English codeswitching data. While the dataset was released with a corresponding English dataset in the same label space, that data is now unavailable. Therefore, we construct an artificial dataset in the same label space using Google Translate on each segment of the structured Spanish-English training data 2 . While the resulting English data is noisy, it provides an estimate of zero-shot transfer from English to Spanish-English codeswitching. The resulting dataset has 5,803 queries in both English and Spanish-English which cover 2 domains with 19 Intents and 10 Slot Types.

Results
We use the same hyperparameter configurations for all settings. The encoder uses the mBERT architecture (Pires et al., 2019). The decoder is a randomly initialized 4-layer, 8-head vanilla transformer for comparison with the 4-layer decoder structure used in Li et al. (2021). We use AdamW and optimize for 1.2 million training steps with early stopping using a learning rate of 2e−5, batch size of 16, and decay the learning rate to 0 throughout of training. We train on a Cloud TPU v3 Pod for approximately 4 hours for each dataset. For all adversarial experiments, we use the unlabeled queries from MTOP as training data for our discriminator and a loss constraint of 0.3 as justified in 3.3. The English data from each benchmark is used for training and early stopping evaluation. We report Exact Match (EM) accuracy on all test splits. In all tables, results significantly (p = 0.05) improve over all others are marked with a † using the bootstrap confidence interval (Dror et al., 2018). Table 1, we report the results of our architecture with mBERT, AMBER, and DAMP compared to existing baselines from prior work: XLM-R with a pointer-generator network ( The AMBER pretraining process significantly improves accuracy for all languages to an average of 23.6. Average accuracy across the 5 Non-English languages improves by 47x. English accuracy also improves to 84.2 from 78.6, instead of suffering negative transfer (Wang et al., 2020). DAMP further improves accuracy over AMBER by 1.8x to 42.2, outperforming both similarly sized models (byT5-Base; +34.2, mT5-Base; +16.1) and models three times its size (mT5-Large; +10.8, XLM-R; +3.4). mT5-XXL maintains state-of-theart performance of 55.1 but requires 33x more parameters and multiple GPUs for inference which increases latency and compute cost.

MTOP In
Adversarial alignment improves performance in each language by at least 10 points, with Hindi and Thai, the most distant testing languages from English, having the largest improvements of +20.7 and +26.5 respectively. DAMP improves over the mBERT baseline by 84x without architecture changes or additional inference cost. Table 2, we report the results on both intra-sentential codeswitching benchmarks. For Hindi-English, we compare the MT5-  AMBER again leads to a performance improvement for both CST5 and CSTOP, across English (+1.4, +5.5) and codeswitched (+12.9, +52.4) data. DAMP also further improves transfer results (+3.8, +1.0) at the cost of small losses in English performance (-0.2, -0.7). DAMP achieves a new stateof-the-art of 20.5 on zero-shot transfer for CST5, outperforming even MT5-XXL (20.3). Since both alignment stages have word-level objectives, we hypothesize that the word-level inductive bias provides benefits for intra-sentential codeswitching despite lacking explicit codeswitching supervision.

Adversary Ablation
In Table 3, we isolate the effects of our contributions to domain adversarial training with an ablation study. While all adversarial variants improve transfer results, we see that using a binary adversary and our constrained optimization technique are both mutually and independently beneficial to adversarial alignment. Notably, DAMP improves over the unconstrained multi-class adversarial technique used in Sherborne and Lapata (2022) by 9.9, 6.4, and 0.9 EM accuracy points on MTOP, CST5, and CSTOP respectively.

Regularization Comparison
We also compare adversarial training to regularization techniques used in cross-lingual learning. We experiment with freezing the first 8 layers of the encoder (Wu and Dredze, 2019) and using the L 1 and L 2 norm penalty (Li et al., 2018). Adversarial learning outperforms these baselines on MTOP and CSTOP while model freezing and L 2 norm penalization outperform adversarial learning on CST5.  However, adversarial learning is the only method that improves across all benchmarks.

Pretrained Decoder Comparison
Finally, we evaluate whether our constrained adversarial alignment technique offers similar benefits to models with pretrained decoders due to their natural advantage in generation tasks. We find that adversarial training does worse than the plain mT5 model (-9.6  adversarial alignment again improves performance in this variant for MTOP and CSTOP. However, the mT5 decoder struggles to adapt to this task, making overall performance worse than DAMP.

Improvement Analysis
Since exact match accuracy is a strict metric, we analyze our improvements through qualitative analysis. We examine examples that DAMP predicts correctly but AMBER and mBERT do not. We then randomly sampled 20 examples from each language for manual evaluation.
Improvements in intent prediction are a large portion of the gain. If intent prediction fails, the rest of the auto-regressive decoding goes awry as the decoder attempts to generate valid slot types for that intent. We report intent prediction results across the test dataset in Table 4.
In general, these improvements follow a trend from nonsensical errors to reasonable errors to correct. For example, given the French phrase "S'il te plait appelle Adam." meaning "Please call Adam."", mBERT predicts the intent QUESTION_MUSIC, AMBER predicts GET_INFO_CONTACT, and DAMP predicts the correct CREATE_CALL.
Within the slots themselves, the primary improvements noted in DAMP are more accurate placement articles and prepositions such as "du", "a", "el", and "la" inside the slot boundaries, which is of arguable downstream importance.
We present the full sample of examples used for this analysis in Tables 5-9 in the Appendix.

Alignment Analysis
We analyze how well our alignment goals are met using two methods in Figure 1. First, we use a twodimensional projection of the resulting encoder embeddings to provide a visual intuition for alignment. Then, we provide a more reliable quantitatively evaluate alignment using a post-hoc linear probe.

Embedding Space Visualization
In Figure 1, we visualize the embedding spaces of each model variant on each MTOP test set using Universal Manifold Approximation and Projection (UMAP) (McInnes et al., 2018). Our visualization of mBERT provides a strong intuition for its poor results, as English and Non-English data form linearly separate clusters even within this reduced embedding space. By using AMBER instead, this global clustering behavior is removed and replaced by small local clusters of English and Non-English data. Finally, DAMP produces an embedding space with no clear visual clusters of Non-English data without English data intermingled.

Post-Hoc Probing
We evaluate improvements to alignment quantitatively. While Sherborne and Lapata (2022) reports the performance of the training adversary as evidence of successful training, this method has been shown insufficient due to mode collapse during training (Elazar and Goldberg, 2018;Ravfogel et al., 2022). Therefore, we train a linear probe on a frozen model after training for each variant using 10-fold cross-validation.
Supporting the visual intuition, probe performance decreases with each stage of alignment. On mBERT, he discriminator achieves 98.07 percent accuracy indicating poor alignment. AMBER helps, but the discriminator still achieves 93.15 percent accuracy indicating the need for further removal. DAMP results in a 23.62 point drop in discriminator accuracy to 69.53. This is still far above chance despite our training adversary converging to close-to-random accuracy. This indicates both the need for post-hoc probing and the possibility of further alignment improvements.

Conclusions and Future Work
In this work, we introduce a Doubly Aligned Multilingual Parser (DAMP), a semantic parsing training regime that uses contrastive alignment pretraining and adversarial alignment during finetuning with a novel constrained optimization approach.
We demonstrate that both of these stages of alignment benefit transfer learning in semantic parsing to both inter-sentential (multilingual) and intrasentential codemixed data, outperforming both similarly sized and larger models. We analyze the effects of our adversarial alignment method, comparing our proposed alignment method broadly to prior both adversarial techniques and regularization baselines, and it's generalizability, with applications to pretrained decoders. Finally, we interpret the impacts of both stages of alignment through qualitative improvement analysis and quantitative probing.

Limitations
This work only carries out experiments using English as the base training language for domain adversarial transfer. It is possible that domain adversarial transfer has a variable effect depending on the training language from which labeled data is used. Additionally, while typologically and regionally diverse, all but one language used in our evaluation is of Indo-European origin.