Neural Machine Translation with Monolingual Translation Memory

Prior work has proved that Translation Memory (TM) can boost the performance of Neural Machine Translation (NMT). In contrast to existing work that uses bilingual corpus as TM and employs source-side similarity search for memory retrieval, we propose a new framework that uses monolingual memory and performs learnable memory retrieval in a cross-lingual manner. Our framework has unique advantages. First, the cross-lingual memory retriever allows abundant monolingual data to be TM. Second, the memory retriever and NMT model can be jointly optimized for the ultimate translation goal. Experiments show that the proposed method obtains substantial improvements. Remarkably, it even outperforms strong TM-augmented NMT baselines using bilingual TM. Owning to the ability to leverage monolingual data, our model also demonstrates effectiveness in low-resource and domain adaptation scenarios.


Introduction
Augmenting parametric neural network models with non-parametric memory (Khandelwal et al., 2019;Guu et al., 2020;Lewis et al., 2020a,b) has recently emerged as a promising direction to relieve the demand for ever-larger model size (Devlin et al., 2019;Radford et al., 2019;Brown et al., 2020). For the task of Machine Translation (MT), inspired by the Computer-Aided Translation (CAT) tools by professional human translators for increasing productivity for decades (Yamada, 2011), the usefulness of Translation Memory (TM) has long been recognized . In general, TM is a database that stores pairs of source text and its corresponding translations. Like for human translation, early work (Koehn and Senellart, 2010;He et al., 2010;Utiyama et al., 2011;Wang et al., 2013, inter alia) presents translations for similar source input to statistical translation models as additional cues.
Recent work has confirmed that TM can help Neural Machine Translation (NMT) models as well. In a similar spirit to prior work, TM-augmented NMT models do not discard the training corpus after training but keep exploiting it in the test time. These models perform translation in two stages: In the retrieval stage, a retriever searches for nearest neighbors (i.e., source-target pairs) from the training corpus based on source-side similarity such as lexical overlaps (Gu et al., 2018;Zhang et al., 2018;Xia et al., 2019), embedding-based matches (Cao and Xiong, 2018), or a hybrid (Bulte and Tezcan, 2019;Xu et al., 2020); In the generation stage, the retrieved translations are injected into a standard NMT model by attending over them with sophisticated memory networks (Gu et al., 2018;Cao and Xiong, 2018;Xia et al., 2019;He et al., 2021) or directly concatenating them to the source input (Bulte and Tezcan, 2019;Xu et al., 2020), or biasing the word distribution during decoding (Zhang et al., 2018). Most recently, Khandelwal et al. (2020) propose a token-level nearest neighbor search using complete translation context, i.e., both the source-side input and target-side prefix.
Despite their differences, we identify two major limitations in previous research. First, the translation memory has to be a bilingual corpus consisting of aligned source-target pairs. This requirement limits the memory bank to bilingual pairs and precludes the use of abundant monolingual data, which can be especially helpful for low-resource scenarios. Second, the memory retriever is non-learnable, not end-to-end optimized, and lacks for the ability to adapt to specific downstream NMT models. Concretely, current retrieval mechanisms (e.g., BM25) are generic similarity search, adopting a simple heuristic. That is, the more a source sentence overlaps with the input sentence, the more likely its target-side translation pieces will appear in the correct translation. Although this observation is true, the most similar one does not necessarily serve the best for NMT models. Ideally, the retrieval metric would be learned from the data in a task-dependent way: we wish to consider a memory only if it can indeed boost the quality of final translation.
In this work, we propose to augment NMT models with monolingual TM and a learnable crosslingual memory retriever. Specifically, we align source-side sentences and the corresponding targetside translations in a latent vector space using a simple dual-encoder framework (Bromley et al., 1993), such that the distance in the latent space yields a score function for retrieval. As a result, our memory retriever directly connects the dots between the source-side input and target-side translations, enabling monolingual data in the target language to be used alone as TM. Before running each translation, the memory retriever selects the highest-scored memories from a large collection of monolingual sentences (TM), which may include but are not limited to the target side of training corpus, and then the downstream NMT model attends over those memories to help inform its translation. We design the memory retriever with differentiable neural networks. To unify the memory retriever and its downstream NMT model into a learnable whole, the retrieval scores are used to bias the attention scores to the most useful retrieved memories. In this way, our memory retrieval can be end-to-end optimized for the translation objective: a retrieval that improves the golden translation's likelihood is helpful and should be rewarded, while an uninformative retrieval should be penalized.
One challenge for training our proposed framework is that, when starting from random initialization, the retrieved memories will likely be totally unrelated to the input. Since the memory retriever does not exert positive influence on NMT model's performance, it cannot receive a meaningful gradient and improve. This causes the NMT model to learn to ignore all retrieved memories. To avoid this cold-start problem, we propose to warm-start the retrieval model using two cross-alignment tasks.
Experiments show that (1) Our model leads to significant improvements over non-TM baseline NMT model, even outperforming strong TM-augmented baselines. This is remarkable given that previous TM-augmented models rely on bilingual TM while our model only exploits the target side.
(2) Our model can substantially boost translation quality in low-resource scenarios by utilizing extra monolingual TM that is not present in training pairs.
(3) Our model gains a strong cross-domain transferability by hot-swapping domain-specific monolingual memory.
2 Related Work TM-augmented NMT This work contributes primarily to the research line of Translation Memory (TM) augmented Neural Machine Translation (NMT). Feng et al. (2017) augmented NMT with a bilingual dictionary to tackle infrequent word translation. Gu et al. (2018) proposed a model that retrieves examples similar to the test source sentence and encodes retrieved source-target pairs with keyvalue memory networks. Cao and Xiong (2018); Cao et al. (2019) used a gating mechanism to balance the impact of the translation memory. Zhang et al. (2018) proposed guiding models by retrieving n-grams and up-weighting the probabilities of retrieved n-grams. Bulte and Tezcan (2019) and Xu et al. (2020) used fuzzy-matching with translation memories and augment source sequences with retrieved source-target pairs. Xia et al. (2019) directly ignored the source side of a TM and packed the target side into a compact graph. Khandelwal et al. (2020) ran existing translation model on large bi-text corpora and recorded all hidden states for later nearest neighbor search at each decoding step, which is very compute-intensive. The distinctions between our work and prior work are obvious: (1) The TM in our framework is a collection of monolingual sentences rather than bilingual sentence pairs; (2) We use learnable task-specific retrieval rather than generic retrieval mechanisms.
Retrieval for Text Generation Discrete retrieval as an intermediate step has been shown beneficial to a variety of natural language processing tasks. One typical use is to retrieve supporting evidence for open-domain question answering (e.g., Chen et al., 2017;Karpukhin et al., 2020). Recently, retrieval-guided generation has gained increasing interest in a wide range of text generation tasks such as language modeling Khandelwal et al., 2019;Guu et al., 2020), dialogue response generation (Weston et al., 2018;  2019a,b), code generation  and other knowledge-intensive generation (Lewis et al., 2020b). It can be observed that there is a shift from using off-the-shelf search engines to learning task-specific retrievers. Our work draws inspiration from this line of research. However, retrieval-guided generation has so far been mainly investigated for knowledge retrieval in the same language. The memory retrieval in this work is more challenging due to the cross-lingual setting.
NMT using Monolingual Data To our knowledge, the integration of monolingual data for NMT was first investigated by Gulcehre et al. (2015), who separately trained target-side language models using monolingual data, and then integrated them during decoding either through re-scoring the beam, or by feeding the hidden state of the language model to the NMT model. Jean et al. (2015) also explored re-ranking the NMT output with a n-gram language model. Another successful method for leveraging monolingual data in NMT is back-translation (Sennrich et al., 2016;Fadaee et al., 2017;Edunov et al., 2018;He et al., 2016), where a reverse translation model is used to translate monolingual sentences from the target language to the source language to generate synthetic parallel sentences. Recent studies (Jiao et al., 2021;He et al., 2019) showed that self-training, where the synthetic parallel sentences are created by translating monolingual sentences in the source language, is also helpful. Our method is orthogonal to previous work and bears a unique feature: it can use more monolingual data without re-training (see §4.3).

Proposed Approach
We start by formalizing the translation task as a retrieve-then-generate process in §3.1. Then in §3.2, we describe the model design for the cross-lingual memory retrieval model. In §3.3, we describe the model design for the memory-augmented translation model. Lastly, we show how to optimize the two components jointly using standard maximum likelihood training in §3.4 and therein we address the cold-start problem via cross-alignment pre-training.

Overview
Our approach decomposes the whole translation processing into two steps: retrieve, then generate. The overall framework is illustrated in Figure 1. The Translation Memory (TM) in our approach is a collection of sentences in the target language Z. Given an input x in the source language, the retrieval model first selects a number of possibly helpful sentences {z i } M i=1 from Z, where M |Z|, according to a relevance function f (x, z i ). Then, the translation model conditions on both the retrieved set {(z i , f (x, z i )} M i=1 and the original input x to generate the output y using a probabilistic model are also part of the input to the translation model, encouraging the translation model to focus more on more relevant sentences. During training, maximizing the likelihood of the translation references improves both the translation model and the retrieval model.

Retrieval Model
The retrieval model is responsible for selecting the most relevant sentences for a source sentence from a large monolingual TM. This could involve measuring the relevance scores between the source sentence and millions of candidate target sentences, which poses a serious computational challenge. To address this, we implement the retrieval model using a simple dual-encoder framework (Bromley et al., 1993) such that the selection of the most relevant sentences can be reduced to Maximum Inner Product Search (MIPS). With performant data structures and search algorithms (e.g., Shrivastava and Li, 2014;Malkov and Yashunin, 2018), the retrieval can be done efficiently.
Specifically, we define the relevance score f (x, z) between the source sentence x and the candidate sentence z as the dot product of their dense vector representations: where E src and E tgt are the source sentence encoder and the target sentence encoder that map x and z to d-dimensional vectors respectively. We implement the two sentence encoders using two independent Transformers (Vaswani et al., 2017). For an input sentence, we prepend the [BOS] token to its token sequence and then feed it into a Transformer. We take the representation at the [BOS] token as the output (denoted Trans {src,tgt} ({x, z})), and perform a linear projection (W {src,tgt} ) to reduce the dimensionality of the vector. Finally, we normalize the vectors to regulate the range of relevance scores.
The normalized vectors have zero means and unit lengths. Therefore, the relevance scores always fall in the interval [−1, 1]. We let θ denote all parameters associated with the retrieval model.
In practice, the dense representations of all sentences in TM can be pre-computed and indexed using FAISS (Johnson et al., 2019), an open-source toolkit for efficient vector search. Given a source sentence x in hand, we compute the vector representation v x = E src (x) and retrieve the top M target sentences with vectors closest to v x .

Translation Model
Given a source sentence x, a small set of relevant Our translation model is built upon the standard encoder-decoder NMT model (Bahdanau et al., 2015;Vaswani et al., 2017): the (source) encoder transforms the source sentence x into dense vector representations. The decoder generates an output sequence y in an auto-regressive fashion. At each time step t, the decoder attends over both previously generated sequence y 1:t−1 and the output of the source encoder, generating a hidden state h t . The hidden state h t is then converted to next-token probabilities through a linear projection followed by softmax function, i.e., To accommodate the extra memory input, we extend the standard encoder-decoder NMT framework with a memory encoder and allow crossattention from the decoder to the memory encoder. Specifically, the memory encoder encodes each TM sentence z i individually, resulting in a set of contextualized token embeddings {z i,k } L i k=1 , where L i is the length of the token sequence z i . We compute a cross attention over all TM sentences: where α ij is the attention score of the j-th token in z i , c t is a weighted combination of memory embeddings, and W m and W c are trainable matrices. The cross attention is used twice during decoding. First, the decoder's hidden state h t is updated by a weighted sum of memory embeddings, i.e., h t = h t + c t . Second, we consider each attention score as a probability of copying the corresponding token (Gu et al., 2016;See et al., 2017). Formally, the next-token probabilities are computed as: where 1 is the indicator function and λ t is a gating variable computed by another feed-forward network λ t = g(h t , c t ). Inspired by Lewis et al. (2020a), to enable the gradient flow from the translation output to the retrieval model, we bias the attention scores with the relevance scores, rewriting Eq. (1) as: (2) where β is a trainable scalar that controls the weight of the relevance scores. We let φ denote all parameters associated with the translation model.

Training
We optimize the model parameters θ and φ using stochastic gradient descent on the negative log-likelihood loss function − log p(y * |x, z 1 , f (x, z 1 ), . . . , z M , f (x, z M )), where y * refers to the reference translation. As implied by Eq. (2), TM sentences that improve the likelihood of reference translations should receive higher attention scores and higher relevance scores, so gradient descent on the loss function will improve the quality of the retrieval model as well.
Cross-alignment Pre-training However, if the retrieval model starts from random initialization, all top TM sentences z i will likely be unrelated to x (or equally useless). This leads to a problem that the retrieval model cannot receive meaningful gradients and improve, and the translation model will learn to completely ignore the TM input. To avoid this cold-start problem, we propose two crossalignment tasks to warm-start the retrieval model.
The first task is sentence-level cross-alignment. This task aims to find the right translation for a source sentence given a set of other translations, which is directly related to our retrieval function. Concretely, We sample B source-target pairs from the training corpus at each training step. Let X and Z be the (B × d) matrix of the source and target vectors encoded by E src and E tgt respectively. S = XZ T is a (B × B) matrix of relevance scores, where each row corresponds to a source sentence and each column corresponds to a target sentence. Any (X i , Z j ) pair should be aligned when i = j, and should not otherwise. The objective is to maximize the scores along the diagonal of the matrix and henceforth reduce the values in other entries. The loss function can be written as: The second task is token-level cross-alignment, which aims to predict the tokens in the target language given the source sentence representation and vice versa. Formally, we use bag-of-words losses: where X i (Y i ) represents the set of tokens in the i-th source (target) sentence and the token probabilities are computed by a linear projection followed by the softmax function. The joint loss for pre-training is 1 tok . In practice, we find that both the sentence-level and token-level objectives are crucial for achieving superior performance.  Asynchronous Index Refresh To employ fast MIPS, we must pre-compute E tgt (z) for every z ∈ Z and build an index. However, the index cannot remain consistent with the running model during training as θ will be updated over time. One straightforward solution to fix the parameters of E tgt after the pre-training described above and only fine-tune the parameters of E src . However, this may hurt performance since E tgt cannot adapt to the translation objective. Another solution is to asynchronously refresh the index by re-computing and re-indexing all TM sentences at regular intervals. The index is slightly outdated between refreshes, however, we use fresh E tgt in gradient estimate. We explore both options in our experiments.

Experiments
We experiment with the proposed approach in three settings: (1) the conventional setting where the available TM is limited to the bilingual training corpus, (2) the low-resource setting where bilingual training pairs are scarce but extra monolingual data is exploited as additional TM, and (3) nonparametric domain adaptation using monolingual TM. Note that existing TM-augmented NMT models are only applicable to the first setting, the last two settings only become possible with our proposed model. We use BLEU score (Papineni et al., 2002) as the evaluation metric.

Implementation Details
We build our model using Transformer blocks with the same configuration as Transformer Base (Vaswani et al., 2017) (8 attention heads, 512 dimensional hidden state, and 2048 dimensional feed-forward state). The number of Transformer blocks is 3 for the retrieval model, 4 for the memory encoder in the translation model, and 6 for the encoder-decoder architecture in the translation model. We retrieve the top 5 TM sentences. The FAISS index code is "IVF1024 HNSW32,SQ8" and the search depth is 64. We follow the learning rate schedule, dropout and label smoothing settings described in Vaswani et al. (2017). We use Adam optimizer (Kingma and Ba, 2014) and train models with up to 100K

Conventional Experiments
Following prior work in TM-augmented NMT, we first conduct experiments in a setting where the bilingual training corpus is the only source for TM.
Data We use the JRC-Acquis corpus (Steinberger et al., 2006) for our experiments. The JRC-Acquis corpus contains the total body of European Union (EU) law applicable to the EU member states. This corpus was also used by Gu et al. (2018); Zhang et al. (2018); Xia et al. (2019) and we managed to get the datasets originally preprocessed by Gu et al. (2018), making it possible to fairly compare our results with previously reported BLEU scores. Specifically, we select four translation directions, namely, Spanish⇒English (Es⇒En), En⇒Es, German⇒English (De⇒En), and En⇒De, for evaluation. Detailed data statistics are shown in Table 1.

Models
To study the effect of each model component, we implement a series of model variants (model #1 to #5 in Table 2).
1. NMT without TM. To measure the help from TM, we remove the model components related to TM (including the retrieval model and the memory encoder), and only employ the encoder-decoder architecture for NMT. The resulted model is equivalent to the Transformer Base model (Vaswani et al., 2017).
2. TM-augmented NMT using source similarity search. To isolate the effect of architectural changes in NMT models, we replace our cross-lingual memory retriever with traditional source-side similarity search. Specifically, we use the fuzzy match system used in Xia et al. (2019) and many others, which is based on BM25 and edit distance.
3. TM-augmented NMT using pre-trained crosslingual retriever. To study the effect of end-toend task-specific optimization of the retrieval model, we pre-train the retrieval model using the cross-alignment tasks introduced in §3.4 and keep it fixed in the following NMT training.
4. Our full model using a fixed TM index; After pre-training, we fix the parameter of E tgt during NMT training.
5. Our full model trained with asynchronous index refresh.

Results
The results of the above models are presented in Table 2. We have the following observations: (1) Our full model trained with asynchronous index refresh (model #5) delivers the best performance on test sets across all four translation tasks, outperforming the non-TM baseline (model #1) by 3.26 BLEU points in average and up to 3.86 BLEU points (De⇒En). This result confirms that monolingual TM can boost NMT performance; (2) The end-to-end learning of the retriever model is the key for substantial performance improvement. We can see that using a pre-trained fixed crosslingual retriever only gives moderate test performance, fine-tuning E src and fixing E tgt significantly boosts the performance, and fine-tuning both E src and E tgt leads to the strongest performance (model #5>model #4>model #3); (3) Cross-lingual retrieval (model #4 and model #5) can obtain better results than that of the source similarity search (model #2). This is remarkable since the crosslingual retrieval only requires monolingual TM, while the source similarity search relies on bilingual TM. We attribute the success, again, to the endto-end adaptability of our cross-lingual retriever. This is manifested by the fact that model #3 even slightly underperforms model #2 in some of translation tasks.

Contrast to Previous Bilingual TM Systems
We also compare our results with the best previously reported models. 2 We can see that our results significantly outperform previous arts. Notably, our best model (model #5) surpasses the best reported model (Xia et al., 2019) by 1.69 BLEU points in average and up to 2.9 BLEU points (De⇒En). This result verifies the effectiveness of our proposed models. In fact, we can see that our translation model using traditional similarity search (model #2) already outperforms the best previously reported results, which reveals that the architectural design of our translation model is surprisingly effective despite its simplicity.

Low-Resource Scenarios
One most unique characteristic of our proposed model is that it uses monolingual TM. This motivates us to conduct experiments in low-resource scenarios, where we use extra monolingual data in the target language to boost translation quality.
Data We create low-resource scenarios by randomly partitioning each training set in JRC-Acquis corpus into four subsets of equal size. We set up two series of experiments: (1) We only use the bilinguals pairs in the first subset and gradually enlarge the TM by including more monolingual data in other subsets.
(2) Similar to (1), but we instead use the bilingual pairs in the first two subsets.   base) and a bilingual TM baseline (model #2).
Results Figure 2 shows the main results on the test sets. The general patterns are consistent across all experiments: the larger the TM becomes, the better translation performance the model achieves. When using all available monolingual data (4/4), the translation quality is boosted significantly. Interestingly, the performance of models without retraining is comparable to, if not better than, those with re-training. We also observe that when the training pairs are very scarce (only 1/4 bilingual pairs are available), a small size of TM even hurts the model performance. The reason could be overfitting. We speculate that better results would be obtained by tuning the model hyper-parameters according to different TM sizes.
Contrast to Back-Translation We compare our models with back-translation (BT) (Sennrich et al., 2016), a popular way of utilizing monolingual data for NMT. We train a target-to-source Transformer Base model using bilingual pairs and use the resultant model to translate monolingual sentences to obtain additional synthetic parallel data. As shown in Table 3, our method performs better than BT with 2/4 bilingual pairs but performs worse with 1/4 bilingual pairs. Interestingly, the combination of BT and our method yields significant further gains, which demonstrates that our method is not only orthogonal but also complementary to BT.

Non-parametric Domain Adaptation
Lastly, the "plug and play" property of TM further motivates us to domain adaptation, where we adapt a single general-domain model to a specific domain by using domain-specific monolingual TM.
Data To simulate a diverse multi-domain setting, we use the data splits in Aharoni and Goldberg (2020) originally collected by Koehn and Knowles (2017

Results
The results are presented in Table 4. We can see that when only using the bilingual data, the TM-augmented model obtains higher BLEU scores in domains with less data but slightly lower scores in other domains compared to the non-TM baseline. However, as we switch the TM to domain-specific TM, the translation quality is significantly boosted in all domains, improving the non-TM baseline by an average of 1.85 BLEU points, with improvements as large as 2.57 BLEU points on Law and 2.51 BLEU point on Medical. We also attempt to combine all domain-specific TMs to one and use it for all domains (the last row in Table 4). However, we do not obtain noticeable improvement. This reveals that the out-of-domain data can provide little help so that a smaller in-domain TM is sufficient, which is also confirmed by the fact that about 90.21% of the retrieved sentences come from the corresponding domain in the combined TM.

Running Speed
With the help of FAISS in-GPU index, search over millions of vectors can be made incredibly efficient (often in tens of milliseconds). In our implementation, the memory search performs even faster than naive BM25 3 . For the results in Table 2, taking the vanilla Transformer Base model (model #1) as the baseline. The inference latency of our models (both model #4 and model #5) is about 1.36 times of the baseline (all use a single Nividia V100 GPU). Note that the corresponding number for the previous state-of-the-art model (Xia et al., 2019) is 1.80. As for training cost, the averaged time cost per training step of model #4 and model #5 is 2.62 times and 2.76 times of the baseline respectively, which are on par with traditional TM-augmented baselines (model #2 is 2.59 times) (all use two Nividia V100 GPUs). Table 5 presents the results. In addition, we also observe that memory-augmented models converge much faster than vanilla models in terms of training steps.

Conclusion
We introduced an effective approach that augments NMT models with monolingual TM. We show that a task-specific cross-lingual memory retriever can be learned by end-to-end MT training. Our approach achieves new state-of-the-art results on sev-  eral datasets, leads to large gains in low-resource scenarios where the bilingual data is limited, and can specialize a NMT model for specific domains without further training. Future work should aim to build over our proposed framework. Two obvious directions are: (1) Even though our experiments validated that the whole framework can be learned from scratch using standard MT corpora, it is possible to initialize each model component in our framework with massively pre-trained models for performance enhancement; and (2) The NMT model can benefit from aggregating over a set of diverse memories, which is not explicitly encouraged in current design.