Non-Parametric Unsupervised Domain Adaptation for Neural Machine Translation

Recently, $k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor ($k$NN) retrieval to achieve domain adaptation without retraining. Despite being conceptually attractive, it heavily relies on high-quality in-domain parallel corpora, limiting its capability on unsupervised domain adaptation, where in-domain parallel corpora are scarce or nonexistent. In this paper, we propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval. To this end, we first introduce an autoencoder task based on the target language, and then insert lightweight adapters into the original NMT model to map the token-level representation of this task to the ideal representation of translation task. Experiments on multi-domain datasets demonstrate that our proposed approach significantly improves the translation accuracy with target-side monolingual data, while achieving comparable performance with back-translation.


Introduction
Non-parametric methods (Gu et al., 2018;Zhang et al., 2018a;Bapna and Firat, 2019a;Khandelwal et al., 2020;Zheng et al., 2021) have recently been successfully applied to neural machine translation (NMT). These approaches complement advanced NMT models (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017;Hassan et al., 2018) with external memory to alleviate the performance degradation when translating out-ofdomain sentences, rare words (Koehn and Knowles, 2017), etc. Among them, kNN-MT (Khandelwal * Corresponding author. et al., 2020) is a simple yet effective non-parametric method using nearest neighbor retrieval. More specifically, kNN-MT equips a pre-trained NMT model with a kNN classifier over a provided datastore of cached context representations and corresponding target tokens to improve translation accuracy without retraining. This promising ability to access any provided datastore or external knowledge during inference makes it expressive, adaptable, and interpretable.
Despite the potential benefits, kNN-MT requires large-scale in-domain parallel corpora to achieve domain adaptation. However, in practice, it is not realistic to collect large amounts of high-quality parallel data in every domain we are interested in. Since monolingual in-domain data is usually abundant and easy to obtain, it is essential to explore the capability of kNN-MT on unsupervised domain adaptation scenario that utilizes large amounts of monolingual in-domain data. One straightforward and effective solution for unsupervised domain adaptation is to build in-domain synthetic parallel data via back-translation of monolingual target sentences (Sennrich et al., 2016a;Zhang et al., 2018b;Dou et al., 2019;. Although this approach has proven superior effectiveness in exploiting monolingual data, applying it in kNN-MT requires an additional reverse model and brings the extra cost of generating back-translation, making the adaptation of kNN-MT more complicated and time-consuming in practice. In this paper, we propose a novel Unsupervised Domain Adaptation framework based on kNN-MT (UDA-kNN). The UDA-kNN aims at directly leveraging the monolingual target-side data to generate the corresponding datastore, and encouraging it to play a similar role with the real bilingual in-domain data, through the carefully designed architecture and loss function. Specifically, we introduce an autoencoder task based on target language to enable datastore construction with monolingual data.
Then we incorporate lightweight adapters into the encoder part of pre-trained NMT model to make the decoder's representation in autoencoder task close to the corresponding representation in translation task. In this way, the adapter module implicitly learns the semantic mapping from the target language to source language in feature space to construct an effective in-domain datastore, while saving the extra cost of generating synthetic data via back-translation.
We evaluate the proposed approach on multidomain datasets, including IT, Medical, Koran and Law domains. Experimental results show that when using target-side monolingual data, our proposed approach obtains 4.9 BLEU improvements on average and even achieves similar performance compared with back-translation.

Background
In this section, we give a brief introduction to the domain adaptation of kNN-MT. In general, the process includes two steps: creating an in-domain datastore and decoding with retrieval on it.
In-domain Datastore Creation. Given a pretrained general domain NMT model and an indomain parallel corpus, kNN-MT utilizes the model to forward pass the corpus to create a datastore. Formally, for each bilingual sentence pair in the corpus (x, y) ∈ (X , Y), the NMT model will generate a context representation h(x, y <t ) for each target-side token y t . Then, the datastore is constructed by collecting the representations and corresponding tokens as keys and values respectively: (1) Decoding with Retrieval. On each decoding step t, the NMT model first generates a representation h(x,ŷ <t ) for current translation context, which consists of source-side x and generated target-side tokensŷ <t . Then, the representation is used to query the in-domain datastore for k nearest neighbors, which can be denoted as .., k}}. These neighbors are utilized to form a distribution over the vocab: where T is the temperature and d(·, ·) indicates the squared euclidean distance. The final probability to predict next token y t is an interpolation of two distributions with a hyper-parameter λ:

Self-Attention
where p NMT indicates the general domain NMT prediction and p kNN represents the in-domain retrieval based prediction.

Unsupervised Domain Adaptation with kNN-MT
Although Khandelwal et al. (2020) has shown the capability of kNN-MT on domain adaptation, the datastore creation heavily relies on highquality in-domain parallel data, which cannot be always satisfied in practice. As in-domain monolingual data is usually abundant and easy to obtain, it is essential to extend the capability of kNN-MT on unsupervised domain adaptation that merely uses large amounts of in-domain target sentences. In this paper, we design a novel non-parametric Unsupervised Domain Adaptation framework based on kNN-MT (UDA-kNN) to fully leverage in-domain target-side monolingual data. The overview framework of UDA-kNN is illustrated in Figure 1. The UDA-kNN starts with the autoencoder task based on target language y, where the target-side is simply copied to the source-side to generate pair (y, y). Based on that, the UDA-kNN aims to make the decoder's representation in autoencoder task close to the ideal representation in translation task. In this way, we can directly leverage autoencoder structure and in-domain target sentences to construct the corresponding datastore for k-nearest-neighbor retrieval, which is similar to that from real in-domain bilingual data. Next, we will introduce the architecture and training objective of our proposed method in detail.
Architecture. We insert lightweight adapter layers (Houlsby et al., 2019;Guo et al., 2020 into the source embedding layer and each encoder layer of the pre-trained NMT model to perform the autoencoder task, by which we only increase a few parameters for our method. Specifically, we simply construct the adapter layer with layer normalization as well as two feed-forward networks with non-linearity between them: where H and H O are the input and output hidden states of the adapter layer respectively, LN indicates layer normalization, W 1 and W 2 are the parameters of the feed-forward networks. Training. The UDA-kNN is designed to leverage monolingual target-side data to generate the corresponding datastore, which plays a similar role with real in-domain bilingual data. We achieve this by leveraging out-of-domain bilingual data (X , Y). More specifically, given a bilingual sentence pair in the corpus (x, y) ∈ (X , Y), the original NMT model generates decoder representation h (x; y<t) for each target token y t . Meanwhile, with the targetcopied pair (y, y), the NMT model with adapters generates another representation for each y t , which can be denoted as h (y; y<t) . We take the end-to-end paradigm to directly optimize the adapter layers by minimizing the squared euclidean distance of the two sets of decoder representations: where θ is the parameters of all adapter layers. Note that we keep original parameters in the pre-trained NMT model fixed during training to avoid the performance degradation of the NMT model in the inference stage.
Prediction. For unsupervised domain adaptation, given the domain-specific target-side monolingual data, we first copy the target sentences to the source side to generate synthetic bilingual pairs. Then the pre-trained NMT model with adapter layers forward passes these pairs to create an in-domain datastore. When translating in-domain sentences, we utilize the original NMT model and kNN retrieval on the in-domain datastore to perform online domain adaptation as Equation (3).

Setup
Datasets and Evaluation Metric. We use the same multi-domain dataset as Aharoni and Goldberg (2020) to evaluate the effectiveness of our proposed model and consider domains including IT, Medical, Koran, and Law in our experiments. We extract target-side data in the training sets to perform unsupervised domain adaptation while keeping the dev and test sets unchanged. Besides, WMT'19 News data 1 is used for training the adapters in our method as well as the reverse translation model for back-translation. The sentence statistics of all datasets are illustrated in table 1. The Moses toolkit 2 is used to tokenize the sentences and we split the words into subword units (Sennrich et al., 2016b) with the codes provided by the pre-trained model (Ng et al., 2019). We use SacreBLEU 3 to measure all results with casesensitive detokenized BLEU (Papineni et al., 2002).  Methods. We compare our proposed approach with several baselines: • Basic NMT: A general domain model is directly used to evaluate on in-domain test sets. • Empty-kNN: The source-side of synthetic bilingual data is always set to <EOS> token. • Copy-kNN: Each target sentence is copied to source-side to produce synthetic bilingual data. This is a special case of our method without model training. • BT-kNN: A reverse translation model is applied to produce synthetic bilingual data, which are used to generate in-domain datastore.  • Parallel-kNN: Ground-truth parallel data is used to generate the in-domain datastore, which can be regarded as the upper bound of the kNN retrieval based methods.
Implementation Details. We use the WMT'19 German-English News translation task winner model (Ng et al., 2019) as our general domain model. For introduced adapters, the hidden size is set to 1024, with only about 6% parameters of the original model. Adam (Kingma and Ba, 2015) is used to update the parameters in adapters. During training, we collect about 40000 tokens for each batch and schedule the learning rate with the inverse square root decay scheme, in which the warm-up step is set as 4000, and the maximum learning rate is set as 7e-4. Faiss 4 is used to build the in-domain datastore to carry out fast nearest neighbor search. We utilize faiss to learn 4096 cluster centroids for each domain, and search 32 clusters for each target token in decoding. When inference, we retrieve 16 nearest neighbors in the datastore. We set the hyper-parameter T as 4 for IT, Medical, Law, and 40 for Koran. The λ is tuned on the in-domain dev sets for different methods.

Main Results
The adaptation performance of different methods are listed in Table 2. Obviously, our method can significantly improve the translation accuracy on in-domain test sets compared to basic NMT, while Empty-kNN and Copy-kNN can't. It demonstrates the efficiency of our proposed method to create an in-domain datastore by leveraging only monolingual data. Besides, we can observe that our method achieves comparable performance over BT-kNN, but completely avoids the reverse model and extra time cost to generate synthetic data, making the adaptation much faster and simpler.

Analysis
In this section, we would like to further explore the reasons behind the success of our approach.
Similarity Measurement. We measure the cosine similarity and squared euclidean distances between the synthetic representations generated by our method and ideals generated using ground-truth parallel data. As shown in Table 3, we also list the results of BT-kNN and Copy-kNN. We can observe that even without the source language information, our UDA-kNN can generate the representations that are close enough to the ideals as BT-kNN, leading to the efficient in-domain retrieval for kNN-MT.
It also verifies the effectiveness of the adapter layers on directly learning the semantic mapping from target language to source language in feature space.  Table 3: Cosine similarity / squared euclidean distance between the ground-truth representations and that generated by different methods.
Visualization. We also collect and visualize the representations with the same target tokens in different datastores to give intuitive insights of the impact of adapters. Specifically, we select three common words in IT domain and show the results in Figure 2. We can see that the representations generated with Copy-kNN tend to gather in small areas, which results in retrieval collapse when meeting diverse translation contexts. While with the adapters, the distribution of the same label in the datastore can be closer to that generated with bilingual pairs, improving the retrieval efficiency.
Effect of Adapter Position. In our proposed method, we only insert adapters into the encoder side as we would like to modify the encoding function of y. It aims to encode the y into the same feature space as the semantically identical x. We also compare our choice to the common practice (Bapna and Firat, 2019b;Guo et al., 2020) Table 4. We can observe that the adapters in the decoder side can only play a very limited role, which also demonstrates the motivation of our approach.  Comparison with Fine-tuning Strategy. We compare our method with BT-FT, where the backtranslation data is used for fine-tuning the full NMT model. The fine-tuning method easily causes catastrophic forgetting problem (Thompson et al., 2019) and results in performance degradation, especially when the data contains noise, as the results shown in Table 5

Conclusion
In this paper, we present UDA-kNN, a simple yet effective framework that directly utilizes monolingual data to construct in-domain datastore for unsupervised domain adaptation of kNN-MT. Experimental results verify that our method obtains significant improvement with target-side monolingual data. Our approach also achieves comparable performance with the BT-based method, while saving the extra cost of generating back-translation.