Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking

Recent progress in task-oriented neural dialogue systems is largely focused on a handful of languages, as annotation of training data is tedious and expensive. Machine translation has been used to make systems multilingual, but this can introduce a pipeline of errors. Another promising solution is using cross-lingual transfer learning through pretrained multilingual models. Existing methods train multilingual models with additional code-mixed task data or refine the cross-lingual representations through parallel ontologies. In this work, we enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models, where the multilingual models are fine-tuned with different but related data and/or tasks. Specifically, we use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks suitable for downstream dialogue tasks. We use only 200K lines of parallel data for intermediate fine-tuning which is already available for 1782 language pairs. We test our approach on the cross-lingual dialogue state tracking task for the parallel MultiWoZ (English -> Chinese, Chinese -> English) and Multilingual WoZ (English -> German, English -> Italian) datasets. We achieve impressive improvements (> 20% on joint goal accuracy) on the parallel MultiWoZ dataset and the Multilingual WoZ dataset over the vanilla baseline with only 10% of the target language task data and zero-shot setup respectively.


Introduction
In recent years, task-oriented dialogue systems have achieved remarkable success by leveraging huge amounts of labelled data. This technology is thus limited to a handful of languages as collecting and annotating training dialogue data for different languages is expensive and requires supervision from native speakers .
To avoid having to create large annotated datasets for every new language, recent works focus on transfer learning methods which use neural machine translation systems (Schuster et al., 2019), code-mixed data augmentation Qin et al., 2020) or large multilingual models (Lin and Chen, 2021). Neural machine translation models incur additional overhead of training on millions of parallel sentences that may not be available for all language pairs. Code-mixed data augmentation methods involve replacing individual words from the source language with the target language by using parallel word pairs found in a dictionary. However, a simple synonym replacement may not be sufficient as the tasks become complicated. In this paper, we focus on transfer learning via large multilingual models, which will allow us to extend models to languages with limited labelled training data.
In techniques that use multilingual models, a task-specific architecture uses this pretrained model as one of its components and then is trained with task data from a high resource language (See Fig.  1). It is then evaluated directly or with some labelled examples in a different language. The use of intermediate fine-tuning, which is fine-tuning a large language model with a different but related data/or task and then fine-tuning it for the target task has shown considerable improvements for both monolingual and cross-lingual natural language understanding tasks (Gururangan et al., 2020;. But, it is relatively under-explored for multilingual dialogue systems. In this work, we demonstrate the effectiveness of using cross-lingual intermediate fine-tuning of multilingual pretrained models to facilitate the development of multilingual conversation systems. Specifically, we look at cross-lingual dialogue state tracking tasks, as they are an indispensable part of task-oriented dialogue systems. In this task, a model needs to map the user's goals and intents in a given conversation to a set of slots and values -  Figure 1: Pipeline of our work. A pretrained language model is fine-tuned with the task of predicting masked words on parallel movie subtitles data. A dialogue state tracker is then trained with this new multilingual model and evaluated for cross-lingual dialogue state tracking known as a "dialogue state" based on a pre-defined ontology. Our intermediate tasks are based on interaction between the source and target languages and interaction between the dialogue history and response. These tasks involve the prediction of missing words in different conversational settings. These include monolingual conversations, concatenated parallel bilingual conversations, and crosslingual conversations. Further, we also introduce a task as a proxy for generating a response in a cross-lingual setup. Our intermediate tasks only use 200K lines of parallel data which is available for 1782 language pairs. Using parallel data for intermediate fine-tuning also becomes an important addition in the intermediate fine-tuning literature which has largely focused on related monolingual tasks. Our best method leads to an impressive performance on the standard benchmark of the Multilingual WoZ 2.0 dataset (Mrkšić et al., 2017b)

Related Work
Intermediate fine-tuning of large language models: Training deep neural networks on large unlabelled text data to learn meaningful representations has shown remarkable success on several downstream tasks. These representations can be monolingual (Qiu et al., 2020) or multilingual (Devlin et al., 2019;Conneau and Lample, 2019;Artetxe and Schwenk, 2019) depending on the underlying training data. These representations are further refined to suit the downstream task by finetuning the pretrained model on related data and/or tasks. This "intermediate" fine-tuning is done before fine-tuning the task-specific architecture on the downstream task.
In adaptive intermediate fine-tuning, a pretrained model is fine-tuned with the same objectives used during pretraining on data that is closer to the distribution of the target task. This is referred to as task adaptive pretraining (TAPT) if the unlabeled text of the task dataset is used (Gururangan et al., 2020;Howard and Ruder, 2018;Mehri et al., 2019) and domain adaptive pretraining if unlabelled data of target domain is used (Gururangan et al., 2020;Han and Eisenstein, 2019). Closer to our problem, Lin and Chen (2021) also use TAPT for generative dialogue state tracking. Another popular method is intermediate task training. Instead of fine-tuning with the objectives used during pretraining of the model, the pretrained model is fine-tuned with single or multiple related tasks as an intermediate step Phang et al., 2019;Glavaš and Vulić, 2021). We refer to the umbrella term of intermediate fine-tuning while discussing our methods.
Our work uses OpenSubtitles (Lison and Tiedemann, 2016), a parallel movie subtitle corpus, as the unlabelled target domain resource. Instead of using the pretrained objectives of the underlying language model directly, we experiment with existing and new objectives to leverage the conversational and cross-lingual nature of the parallel data. As there is a dearth of availability of training data for dialogue tasks across different languages, instead of relying on the related task datasets to perform intermediate fine-tuning, we leverage the dialogue data available through OpenSubtitles (See Table 1). Cross-lingual dialogue state tracking: Dialogue state tracking (DST) is one of the most studied problems in task-oriented conversational systems (Mrkšić et al., 2017a;Ren et al., 2018;. The goal of the dialogue state tracker is to accurately identify the user's goals and requests at each turn of the dialogue. These goals and requests are stored in a dialogue state which is predefined based on the ontology of the given domain. For example, the restaurant reservation domain will consist of slot-names like "price-range" and values like "cheap". Dialogue state tracking has been explored extensively for the monolingual setup but there are limited works for a multilingual setting. A popular benchmark for cross-lingual dialogue state tracking is the Multilingual WoZ 2.0 dataset (Mrkšić et al., 2017b) where a dialogue state tracker is trained only on English data and it is evaluated directly for German and Italian dialogue state tracking. XL-NBT , the first neural cross-lingual dialogue state tracker uses a teacherstudent network where the teacher network has access to task labelled data in the source language. The teacher also has access to parallel data which allows it to transfer knowledge to the student network trained in the target language. A couple of recent works resort to code-mixed data augmentation to enhance transfer learning. In Attention-Informed Mixed Language Training (AMLT) , initially, a dialogue state tracker (Mrkšić et al., 2017a) is trained with English state tracking data. The new code-mixed training data is obtained by replacing the words which receive the highest attention in the given utterance during training of the model with the source language with their respective synonyms in the target language. Another method dubbed as Cross-Lingual Code Switched Augmentation (CLCSA) (Qin et al., 2020) focuses on the dynamic replacement of source language words with target language words during training. In this method, the sentences within a batch are chosen randomly, and then words within these sentences are chosen randomly which are replaced with the synonyms from their target language. This method is state-of-the-art for the Multilingual WoZ dataset.
Another recent benchmark is the parallel Multi-WoZ 2.1 dataset released as a part of the Ninth Dialogue Systems and Technologies Challenge (DSTC-9) (Gunasekara et al., 2020). Both the ontology of the dialogue states and the dialogues were translated from English to Chinese using Google Translate and then corrected manually by expert annotators. Similarly, CrossWoZ (Zhu et al., 2020a), a Chinese dialogue state tracking dataset was translated into English. The challenge was designed to treat the source dataset as a resourcerich dataset and build a cross-lingual dialogue state tracker which would be evaluated for the low resource target dataset. Instead, all the submissions in the shared task used the translated version of the dataset and treated the problem as a monolingual dialogue state tracking setup.
We use the Multilingual WoZ dataset and the parallel MultiWoZ dataset to demonstrate the effectiveness of our methods. As there are no existing bench-marks for cross-lingual dialogue state tracking for the parallel MultiWoZ dataset, we use the slotutterance matching belief tracker (SUMBT)  as our baseline, which was the state-ofthe-art for the English MultiWoZ 2.1 dataset (Eric et al., 2020). The SUMBT model uses BERT encoder to obtain contextual semantic vectors for the utterances, slot-names, and slot values. It then uses a multi-head attention network to learn the relationship between slot-names and slot-values appearing in the text to predict the dialogue states.

Intermediate fine-tuning for dialogue tasks
In this section, we will provide details about the training data used for different intermediate tasks, explain existing and proposed intermediate tasks, and detail their integration into the end task.

Adaptive data extraction
The pretrained language models are often trained on news text or Wikipedia which is different from human conversations (Wolf et al., 2019). We choose OpenSubtitles corpus (Lison and Tiedemann, 2016) as the characteristics of this corpus are suitable for our end task.The corpus is huge (beyond 3.2G sentences) and contains parallel movie dialogue data across different language pairs, allowing us to design cross-lingual tasks as well. We extract 200K parallel subtitles for every language pair. These are extracted without modifying the sequence of their occurrence in a particular film, as we intend to work on conversations and not sentences in isolation.

Tasks for intermediate fine-tuning
After extracting the task-related data, we experiment with existing and new intermediate tasks to continue fine-tuning the underlying multilingual representation for the dialogue tasks. These tasks are variants of the Cloze task (Taylor, 1953), where missing words are predicted for a given sentence/context. This task is also known as Masked Language Modelling (MLM) (Devlin et al., 2019). We introduce extensions to the masked language modelling which are more suitable for the dialogue task. Our task designs are based on (i) interaction between the source and target languages and (ii) interaction between the dialogue history and response. In the rest of the work, the use of the word "context" focuses on the role of dialogue history.

Monolingual dialogue modelling (MonoDM):
Dialogue history is an important component of any dialogue task. We select K continuous subtitles from the monolingual subtitles data where K is chosen randomly between 2 to 15 for every example. By choosing a random K, we ensure that the examples contain varied length dialogues as will be the case for any dialogue related task. These examples are created for both the source and the target language and 15% of the words in each example are masked.
We now look at cross-lingual intermediate tasks that leverage the parallel data in OpenSubtitles. The following tasks are designed to exploit the contextual information from the dialogue history as well as cross-lingual information through the parallel data. Please see Table 1 for examples. Translation language modelling (TLM): Translation language modelling (TLM) was introduced while designing the Cross-Lingual Language Model (XLM) (Conneau and Lample, 2019). In TLM, parallel sentences are concatenated and words are masked across them. We further explore the importance of longer context in modelling cross-lingual embedding spaces for the conversational setting by concatenating parallel dialogues with K utterances and then masking words randomly on this concatenated text. The hypothesis is that by predicting masked words in different languages simultaneously, the model improves the alignment in its cross-lingual representation space. For the example in Table 1, the model may learn to align "bat" with "Fledermaus". Cross-lingual dialogue modelling (XDM): This task focuses on improving cross-lingual contextresponse representation space. In TLM, it is difficult to identify if the predicted word used its monolingual context or the bilingual dialogue history. To encourage a cross-lingual interaction between the dialogue history and the response, we concatenate a conversation context (K utterances) from one language and then append the reply to that conversation in the second language. The words are then randomly masked across this chat.
Response masking (RM): We also experiment with a setup that acts as a proxy for generating a response in a cross-lingual setting. The context of the conversation is provided in one language and the task is to predict the words in the response independently in another language. This is a harder task than predicting randomly masked words.  Both XDM and RM are new designs for intermediate tasks, tailored for cross-lingual dialogue tasks. We also experimented with combining monolingual and cross-lingual objectives but our pilot experiments did not show any considerable improvement over the individual objectives. For tasks where combining multiple objectives has worked, those tasks required higher reasoning and inference capabilities like coreference resolution or question answering Aghajanyan et al., 2021). Such highly specific task data is not available for all languages and even further limited for conversational tasks. We will explore this direction in future. Similarly, our initial experiments suggested that simply combining data from multiple languages for a multilingual intermediate task has lower performance than individual crosslingual intermediate tasks. Thus, designing multilingual intermediate tasks is far from trivial and we will also explore this in future.

Using intermediate fine-tuning for dialogue state tracking
We create 100K examples for all of the above intermediate tasks for respective language pairs. We use the mBERT (Devlin et al., 2019) model as our starting point and continue training the mBERT model with the above tasks separately. Thus, all of our reported experiments follow a two-step pipeline procedure where (i) mBERT is fine-tuned with one of the tasks listed as above and then (ii) a dialogue state tracking model, that uses the new mBERT model, is trained with source language training data with or without additional training data of the target language. Finally, the trained dialogue state tracking model is evaluated on the target language. Please see Fig. 1 for an illustration.

Experiments
We experiment with the recently released parallel MultiWoZ dataset (Gunasekara et al., 2020) and the Multilingual WoZ dataset (Mrkšić et al., 2017b). As the datasets vary in difficulty and languages, we choose a different amount of target training data and dialogue state tracking architectures for both of them. We briefly provide their description and discuss the results obtained with our methods.

Task description
Parallel  German dialogue states, to compare with other approaches in the literature. We use the state tracker in Qin et al. (2020) that treats the problem as a collection of binary prediction tasks, one task for each slot-value combination. The current utterance and the previous dialogue act are concatenated together and passed through the pretrained multilingual encoder. All the slot value pairs are passed through the encoder to obtain their representations respectively. These representations are then fed into a classification layer. We do not use SUMBT for this dataset as the cross-lingual state tracking performance was not as competitive as other models in the literature. The training details are listed in Appendix A.

Metrics
The metrics used for dialogue state tracking tasks are turn-level and generally include Slot Accuracy, Slot F1, and Joint Goal Accuracy (JGA). Their descriptions are as follows: Slot Accuracy: Proportion of the correct slots predicted across all utterances. Slot F1: Macro-average of F1 score computed over the individual slot-types and slot-values for every turn. Joint Goal Accuracy: Proportion of examples (dialogue turns) where the predicted dialogue state matches exactly the ground truth dialogue state.
We report Slot F1 and Joint Goal Accuracy for the parallel MultiWoZ dataset. The En state has 135 slot types while the average number of slot types per utterance is 5. When slot accuracy is computed, it also marks all those slots which were not predicted. Consider 130 not predicted slots, 3 correct slots and 2 incorrect slots. By the definition of accuracy, it would be computed as 133/135 = 0.98 which overlooks the two incorrect slots. Thus, we do not report slot accuracy as it is the least indicator of improvement.
We report Joint Goal Accuracy for Multilingual WoZ dataset, where the state only consists of informable slots. Similarly, Slot Accuracy for informable slots and Request Accuracy for requestable slots are also reported, in line with the literature for this task.

Results
We report the results of models with and without intermediate task learning for the parallel MultiWoZ dataset in Table 2 and the Multilingual WoZ dataset in Table 3. We compare the performances of our intermediate fine-tuning methods with task-adaptive pretraining (TAPT) to distinguish the design of our intermediate tasks against simply using the task training data. We also compare our methods on Multilingual WoZ with XL-NBT , Attention Informed Mixed Language Training  and CLCSA (Qin et al., 2020).
Our results show that the use of intermediate finetuning of a language model is indeed helpful for dialogue state tracking. Further, the use of crosslingual objectives (XDM, RM, TLM) is indeed superior to task adaptive pretraining (TAPT) and competitive to the monolingual objective (Mon-oDM) with TLM consistently performing better than all the cross-lingual objective functions in the target language state tracking. This also suggests that the use of bilingual dialogue history (TLM) is superior to the use of cross-lingual context (XDM) or a harder response generation task (RM) for these datasets.
In Table 2  improvement over the vanilla baseline on joint goal accuracy for target languages Zh and En respectively. The best intermediate task (TLM) has an improvement of 20.4% and 24.3% on joint goal accuracy respectively for En → Zh and Zh → En. The Slot F1 score has similar trends as the joint goal accuracy. Intermediate fine-tuning helps to improve the performance for source language state tracking as well, with monolingual objectives (TAPT, Mon-oDM) exhibiting a superior performance as they are trained with monolingual task data.
Comparison with machine translation: As there are no other baselines available for MultiWoZ, we also compare our approach to translation based methods in Table 2. We follow the setup for Inlanguage training, Translate-train, and Translatetest as described in Hu et al. (2020). In Inlanguage training, we fine-tune the mBERT model directly with target language training data. For the Translate-train models, we first translate the source language training data of the dialogue task into the target language and then train a dialogue state tracking model with mBERT on the translated target language data. In Translate train, the dialogue state tracking model is trained with the source language data on source language BERT. At test time, the target language instances are translated into the source language to predict the dialogue states for these given instances. Our machine translation models are large transformer models (Vaswani et al., 2017) trained on Paracrawl data (Bañón et al., 2020) for En → Zh and Zh → En respectively. Our setup improves over the Translate-test approach which uses these additional translation models and mono-lingual BERT models. We also find that Translate Train and In-language training find this setup difficult as the model would map a target language utterance to a source language state instead of a target language state. Further, following guidelines from Hu et al. (2020), these models are trained with multilingual BERT which is trained on 108 languages, leading to a noisier representation space than a monolingual BERT. Overall, we find that the scores are higher for Zh → En than En → Zh. We speculate this trend is due to the presence of translationese when using Zh as the source language as the dataset is originally in English then translated to Chinese, in line with the observations from neural machine translation literature (Edunov et al., 2020).
Additive effect of TLM with CLCSA: In Table  3, we find that TLM has 27.5% and 24.3% improvement over the vanilla baseline on joint goal accuracy for De and It respectively. It also has superior performances over baselines from the literature except for the CLCSA method. The CLCSA method uses dynamic code-mixed data for training the state tracker. We observe that using TLM with the CLCSA model has an additive effect, providing an improvement over a model which does not use the model with TLM as an intermediate finetuning task. Please note that our experiments for both CLCSA and CLCSA + TLM used an uncased version of multilingual BERT as opposed to the cased version of multilingual BERT in the original CLCSA results as it has better performance. We also find that RM is not best suited for this task suggesting that response prediction is not a suitable intermediate task for simple scenarios of the WoZ dataset.

Analysis
We analyse the outputs from the state tracker and design choices for the intermediate tasks. We also provide insights into the difficulty of conducting zero-shot transfer learning using the SUMBT architecture for the MultiWoZ dataset.

Qualitative analysis
We manually analyzed the predicted dialogue states for 200 chats from these models for the MultiWoZ dataset. Overall, we found that models trained with intermediate tasks improve over the vanilla baselines in detecting cuisine names, names of restaurants, and time periods for booking (taxi/restaurant). All models show some confusion in detecting whether a location corresponds to arrival or departure. We observe that predicting a dialogue state wrong at an earlier stage has a cascading effect of errors on the later dialogue states. For the Multilingual WoZ dataset, the baseline models struggled to identify less frequent cuisines. There was confusion between predicting "cheap" and "moderate" in the target languages. These errors were reduced with intermediate fine-tuning.
Please see examples in Appendix C.

Investigating zero-shot transfer for MultiWoZ dataset
We make a case for using 10% of training data in the target language and retaining the language of the source state for the MultiWoZ dataset. We illustrate different training data choices in Table 4. We currently look at the En → Zh setup.  The zero-shot setup is difficult for the models -with the vanilla baseline model, it seems nearly impossible to learn a dialogue state tracker for Chinese. Even with TLM, while there is an improvement in the multilingual representation space, it is not adequate for a generalized transfer across languages. However, when a pretrained model which is fine-tuned with a cross-lingual objective, is trained with as little as 1% labelled target language training data (84 chats), we observe 19.3% improvement over the joint goal accuracy for the target language over the zero-shot vanilla baseline. This also indicates the data efficiency of the crosslingual intermediate fine-tuning. With the increase in target training data, the performance for the target language also improves while degrading the source language performance.
We also found that using the target language states during evaluation has lower performance than source language dialogue states for this dataset while using the SUMBT model. Using a dialogue state tracker trained with TLM on zero-shot setup had joint goal accuracy of 1%. We recommend mapping the dialogue states from the source language to the target language directly for use cases that require the dialogue state to be predicted in the target language.

Domain of adaptive task data
We considered the parallel document level data released for the WMT'19 challenge (Bojar et al., 2019). We look at the En-Zh parallel data consisting of news articles that are aligned by paragraphs.  We fine-tune the mBERT model with the TLM task for parallel paragraphs. We report our results for the MultiWoZ dataset in Table 5. We find that using dialogue data has a slight advantage over using parallel news text as seen in Table 5. This sug-  gests that cross-lingual alignment itself is largely responsible for the increase in the joint goal accuracy over the baseline than the domain of the intermediate task data. Nevertheless, we recommend the use of OpenSubtitles for intermediate task data as it not only performs better but also is available for 1782 language pairs.

Amount of intermediate task data
We used a fixed number of examples for the intermediate fine-tuning. We now vary the amount of intermediate task data and study its performance on the downstream task. As seen from   . As OpenSubtitles is available for 1782 language pairs, we speculate that using these cross-lingual intermediate tasks will be effective for languages where a collection of large training datasets for dialogue tasks is not feasible. We speculate that this setup can be useful for crosslingual domain transfer too -when such benchmark becomes available for dialogue tasks. We hope that our method can serve as a strong baseline for future work in multilingual dialogue.

A Reproducibility Details
Hyperparameters: All the intermediate finetuning models were trained with HuggingFace's transformers library (Wolf et al., 2020). We followed the guidelines from  to select the hyperparameters. The fine-tuning was carried out for 20 epochs. The batch size was between {4, 8}. The rest configuration was kept as default in the library. For the SUMBT model, the LSTM size was varied between {100, 300}, the learning rate between {1e − 4, 1e − 5, 5e − 5}, and batch size between {3, 4, 12}. Rest hyperparameters were kept as default as the original work. The final configurations were chosen based on the joint goal accuracy for the development set. The training was carried out for 100 epochs as default with patience of 10 epochs. For the Multilingual WoZ experiments, we followed the hyperparameters listed in Qin et al. (2020) All of our hyperparameters for all the experiments will be made available as config files. We use code from Zhu et al. (2020b) for the SUMBT model and Qin et al. (2020) for the CLCSA model.
Training details: Intermediate fine-tuning takes approx 14 hours on RTX 2080 Ti, training a SUMBT model takes approx six hours, and the base architecture for Multilingual WoZ takes around three hours. The training hours on a different GPU may vary. The inference time for the SUMBT model on the MultiWoZ dataset is 4 minutes while that of the Multilingual WoZ is a minute per language. Similarly, the GPU memory for intermediate fine-tuning and SUMBT takes up the entire ram of RTX 2080 Ti ( approx 11 GB) and the Multilingual WoZ experiments occupy 7 GB RAM. All the experiments require a single GPU. The parameters in the mBERT model are approx 178M. The parameters in the dialogue state trackers without the mBERT model are approx 5.2 M and 0.1 M for the MultiWoZ dataset and Multilingual WoZ dataset respectively.
Dataset details: The dialogue state tracking datasets are available at the code repositories of Zhu et al. (2020b) and Qin et al. (2020) respectively. The OpenSubtitles corpus can be obtained from the corpus website 2 which is based on the subtitles website 3 . We will release the extracted examples and their variants as well. Please see Table 8 for statistics. While creating the 10% of the labelled target language data, all the domains the in the MultiWoZ data were included according to their proportion in the original training data.

B Utterance v/s Dialogue history for Multilingual WoZ
We report the importance of using dialogue history in Table 9.

C Qualitative Examples
In Table 10, the first example demonstrates how TLM can identify named entities such as names of restaurants that the baseline could not predict. Similarly, the baseline has a higher error rate detecting the dialogue states with numbers, as seen in examples one and two. The third example is a continuation of the conversation in the second example. Note that the baseline model is now capable of predicting all the new dialogue states in this example. But it is penalized as it could not predict the train-arriveby state at the start of the conversation leading to cascading of errors.