Fast and Accurate Neural Machine Translation with Translation Memory

It is generally believed that a translation memory (TM) should be beneficial for machine translation tasks. Unfortunately, existing wisdom demonstrates the superiority of TM-based neural machine translation (NMT) only on the TM-specialized translation tasks rather than general tasks, with a non-negligible computational overhead. In this paper, we propose a fast and accurate approach to TM-based NMT within the Transformer framework: the model architecture is simple and employs a single bilingual sentence as its TM, leading to efficient training and inference; and its parameters are effectively optimized through a novel training criterion. Extensive experiments on six TM-specialized tasks show that the proposed approach substantially surpasses several strong baselines that use multiple TMs, in terms of BLEU and running time. In particular, the proposed approach also advances the strong baselines on two general tasks (WMT news Zh->En and En->De).


Introduction
A translation memory (TM) is originally collected from the translation history of professional translators, and provides the most similar source-target sentence pairs for the source sentence to be translated (Garcia, 2009;Koehn and Senellart, 2010b;Utiyama et al., 2011;Robinson, 2012;Huang et al., 2021). A TM generally provides valuable translation information particularly for those input sentences preferably matching the source sentences in the TM, and many efforts have been devoted to integrating a TM into statistical machine translation (Simard and Isabelle, 2009;Koehn and Senellart, 2010a;Ma et al., 2011;Wang et al., 2013;. Recently there are increasing interests in improving neural machine translation (NMT) with a * Corresponding author.
TM (Li et al., 2016;Farajian et al., 2017;Gu et al., 2018;Xia et al., 2019;Bulte and Tezcan, 2019;Xu et al., 2020). Many notable approaches have been proposed to augment an NMT model by using a TM. For example, Zhang et al. (2018) and He et al. (2019) extract scored n-grams from a TM and then reward each partial translation once it matches an extracted n-gram during beam search. Gu et al. (2018) and Xia et al. (2019) use an auxiliary network to encode a TM and then integrate it into the NMT architecture. Bulte and Tezcan (2019) and Xu et al. (2020) employ data augmentation to train an NMT model whose training instances are bilingual sentences augmented by their TMs. Despite their improvements on the TM-specialized translation tasks (aka JRC-Acquis corpora) where a TM is very similar to test sentences, they consume considerable computational overheads in either training or testing, and particularly it is unclear whether they can deliver gains over standard NMT on general tasks where a TM is not very similar to test sentences. Indeed, both Zhang et al. (2018) and Xu et al. (2020) reported their failures on WMT news translation tasks.
In this paper, we present a fast and accurate approach for TM-based NMT which can be applied to general translation tasks besides TM-specialized tasks. We first design a light-weight TM-based NMT model for efficiency: its TM includes a single bilingual sentence and we explore variant ways to encode the TM. Also, the designed model outperforms strong TM-based baselines. Second, we deeply analyze its translation performance and observe an issue of robustness: it decreases significantly for those input sentences which are not very similar to their TMs, although it obtains substantial improvements for other inputs. To address this issue, we propose a novel training criterion for optimizing the parameters of our model inspired by multiple-task learning (van Dyk and Meng, 2001;Ben-David and Borbely, 2008;Qiu et al., 2013). The loss function includes two terms: the first term is induced by the bilingual corpus with a TM whereas the second term is induced by the bilingual corpus without any TM. In this way, the TM-based NMT model gains better performance and is robust to translate any input sentences no matter they are similar to their TM or not. Additionally, this makes it possible that a single unified model can handle both translation situations (with or without a TM), which is practical for online services.
To validate the effectiveness of the proposed approach, we conduct extensive experiments on eight translation tasks including both TM-specialized tasks and general tasks (WMT). Our experiments justify that the proposed approach is better than several strong TM-based baselines in speed, and it further delivers substantial gains (up to 4.7 BLUE points) over those baselines on TM-specialized tasks, leading to up to 8.5 BLEU points over standard Transformer-based NMT. In particular, it also outperforms strong baselines on two general translation tasks, i.e., with a gain of 0.7 BLEU points on WMT14 En→De task and 1.0 BLEU point on WMT17 Zh→En task. This paper makes the following contributions: • It points out a critical issue about robustness when training TM-based NMT models and provides an elegant method to address this issue.
• It proposes a simple TM-based NMT model that outperforms strong TM-based baselines in terms of both translation quality and speed.
• It verifies that a well-designed TM-based translation model is able to advance strong MT baselines on general translation tasks where a TM is not very similar to input source sentences.

Preliminary on NMT
., x n } is a source sentence and y = {y 1 , ..., y m } is the corresponding target sentence. From the probabilistic perspective, NMT models the conditional probability of the target sentence y given the source sentence x. Formally, for a given x, NMT aims to generate the output y according to the conditional probability P (y|x) defined by neural networks: where y <i = {y 1 , . . . , y i−1 } denotes a prefix of y, and each factor P (y i |x, y <i ) is defined as follows: where h D,L i indicates the i th hidden unit at L th layer in the Decoding phrase under the encoderdecoder framework (Bahdanau et al., 2016), and φ is a linear network that projects hidden units onto vectors with dimension of the target vocabulary.
Recently, self-attention networks have attracted many interests due to their flexibility in parallel computation and modeling h D,L i . The state-of-theart NMT model is Transformer (Vaswani et al., 2017), which uses stacked self-attention and fully connected layers for its encoder and decoder. Selfattention relies on an attention mechanism to compute a representation of a sequence. In Transformer, there are three kinds of attention mechanisms, including encoder multi-head attention, decoder masked multi-head attention and encoderdecoder multi-head attention. Attention with H heads can be calculated by the equations: where q is a query vector and u is a twodimensional matrix, [u j ] H j=1 denotes concatenation of all vectors u j , φ j and ψ j stand for two linear projections from one matrix to another matrix, respectively. The 1 √ d is the scaling factor, and d is the dimension of q. And we refer enthusiastic readers to Vaswani et al. (2017) for detailed definitions.

Model Architecture
In this section, in order to preferably bridge TM and NMT, we propose the architecture of TM-based NMT within the Transformer. To make our proposed model fast in running time and powerful in quality, at first, we present a configuration of TM to make the proposed model efficient. Then we explore three different methods to encode the TM into a sequence of vectors in a coarse-to-fine manner. Finally, we propose the architecture that decodes a target word given an input source sentence and its TM representation.

TM Configuration
Following previous works (Gu et al., 2018;Zhang et al., 2018;Xia et al., 2019)  sentence x we employ Apache Lucene (Bialecki et al., 2020) to retrieve top-100 similar bilingual sentences from the training data. Then we adopt the following similarity to re-rank the retrieved bilingual sentences and maintain top-K (K < 100) bilingual sentences as the TM for x: where dist denotes the edit-distance, and x tm is a retrieved source sentence from the training data and its reference is y tm . Previous studies show that the best translation quality is achieved when the size K of the TM is larger than 1. For example, the optimized K is set to be 5 in Gu et al. (2018) and Xia et al. (2019), and it is even set to be 100 in Zhang et al. (2018). Unfortunately, such a large K significantly decreases the translation speed because the computational complexity is linear in the size of K. To make our inference as efficient as possible, we set K = 1 and employ the most similar bilingual sentence denoted by x tm , y tm as the TM for x. 1

Encoding TM
In this subsection, we will describe how to encode the TM x tm , y tm into a sequence of vectors m. 1 We also did some experiments on K = 2 and K = 4 in our proposed model, but we did not observe significant gains. Three variant methods for encoding a TM are illustrated in the right part of Figure 1.
Method 1: sentence (TF-S) Given x tm , y tm for x, the first method utilizes word embedding and position embedding of y tm to represent m as follows: where E w and E p are word embedding and position embedding respectively, J is the length of y tm and the symbol + denotes a simple addition operator.
Method 2: sentence with score (TF-SS) The first method is agnostic to the similarity score. Intuitively, if a TM x tm , y tm is with high similarity, y tm may be more helpful to predict a good translation. So, the second method takes the similarity score into account and it defines m as follows: is the similarity score and the symbol × denotes the scalar-multiplication.
Method 3: sentence with alignment (TF-SA) As shown in Figure 1, x tm consists of the matched parts (in orange color) and the unmatched parts (in dark color) to x. Since each word in the TM is not of the same importance to the source sentence x, we should pay more attention to the words that are in the matched parts. So, we further obtain the word alignment between x tm and y tm through fast-align toolkit (Dyer et al., 2013). 2 Suppose A tm is the word alignment between x tm and y tm : A j tm = 1 denotes y j is aligned to some x i otherwise A j tm = 0, where x i is also in x . Therefore, the third method defines m as follows: where the symbol • denotes an operator between a vector and a matrix such that

TM Augmented NMT
Suppose the encoded TM x tm , y tm is denoted by m, a sequence of vectors. We aim to build a model P (y i | x, y <i , m) for the source sentence x, given the m and prefix translation y <i at time step i, leading to the entire translation model: where θ denotes the parameter of our proposed model. 3 Example Layer The model architecture of P (y i | x, y <i , m) is illustrated at the left part of Figure 1, where its architecture is generally similar to standard Transformer and the core component is the Example Layer. Specifically, the Example Layer includes two multi-head attention operators: the left multi-head attention (i.e. MH-Att (y <i , y <i )) is the same as Transformer, and it is defined on the prefix translation y <i ; the right multihead attention (i.e. MH-Att (y <i , y tm )) attempts to capture information from the TM, and its query is from y <i while key and value are from the representation of TM m. After the two parallel attention operators, two resulting sequences are passed to Add & Norm operator and a new sequence is obtained as the query for the next multi-head attention (i.e. MH-Att (y <i , x)). The following sub-layer is the same as Transformer and P (y i | x, y <i , m) can be obtained similar to the definition of standard NMT P (y i | x, y <i ) as presented in Section 2. We skip those formal equations to rewrite P (y i | x, y <i , m) due to space limitation.
In summary The entire model architecture is illustrated in Figure 1: the dashed box in the right part shows the memory encoder, and the left part shows how the memory representation is used in the NMT model similar to the Transformer. In our model architecture, the encoder block contains two sub-layers and the decoder block contains three sub-layers. The core sub-layer in the decoder block is our proposed Example Layer, which consists of multi-head attention and cross attention. By introducing the memory encoder and Example Layer, the parameters in our model are increased only by 8.96% compared to the standard NMT baseline.

Training
Suppose the training corpus is a bilingual sentence, and x i tm , y i tm is the related TM which consists of a single bilingual sentence. Our goal is to learn the parameter θ of the TM-based NMT model P (y | x, x tm , y tm ; θ) defined in Eq.(9) using D.
The common wisdom is to optimize the parameter under the maximum likelihood estimation (MLE), i.e. standard training. Formally, it minimizes the following criterion: Robustness issue Unfortunately, the model trained with MLE suffers from an issue about robustness even if its overall performance is much better than standard Transformer and outperforms TM-based baselines on the Es→En task. According to our experiments (see Table 4 later), our proposed model performs worse than the Transformer for those sentences which do not have a similar TM. As a result, it would be dangerous to use the model for online services because users may provide an input sentence whose TM is not similar to itself. The possible reason for the above issue is explained as follows. On the average case, the reference y is strongly correlated to its TM target y tm in the training corpus D. For example, the average similarity score is about 0.58 for Es→En translation task, according to our statistics. Because of the powerful fitting ability of neural networks, the model parameters will be guided to heavily depend on the given TM target y tm during training. In this way, if an input source sentence x has a high similarity with its given TM, the model will output high-quality results, as we also observed in Table 5. On the contrary, once an input sentence is provided with a low similar TM x tm , y tm (for instance, the similarity between 0 and 0.3, as shown in Table 4), the translation quality of its output rapidly decreases.
Training criterion In order to avoid the TM over-fitting, we propose a simple yet elegant method, inspired by data augmentation (van Dyk and Meng, 2001;Zhong et al., 2020) and multiple-task learning (Ben-David and Borbely, 2008;Qiu et al., 2013;Liu et al., 2016). Specifically, we first construct another corpus In the constructed corpus, null, null plays a role of a TM, but both source and target sides of the TM are empty sentences. 4 Then we train the model P (y | x, x tm , y tm ; θ) using both D and D 0 , i.e. joint training, which is similar to multiple-task learning. Formally, we minimize the following joint loss function: ) where 0 < λ is a coefficient to trade off both loss terms. Intuitively, the first term induced by D guides the model to use the information from a TM for prediction, and thereby it will generate accurate translations for those input source sentences whose TM is with high similarity. On the other hand, the second term induced by D 0 teaches the model to output good translations without information from a TM. Additionally, this makes it possible that a single unified model can handle both translation scenarios (with or without a TM), which is practical for online services.
Note that the proposed method is slightly different from standard data augmentation (Sennrich et al., 2016a;Fadaee et al., 2017;Fadaee and Monz, 2018; and multiple-task learning (Dong et al., 2015;Kiperwasser and Ballesteros, 2018;Wang et al., 2020) in NMT research. These data augmentation techniques automatically generate pseudo data based on the original training data and then train a model using both original and generated data. However, the dataset D 0 is Algorithm 1: Joint Training Algorithm Input: Mini-batch size b, maximal iteration M , a learning rate schema η and two corpus: Update parameter: θ = θ − η t ∆ directly taken from the original D in our scenario. Also, multiple-task learning in their works typically involves different models that share some partial parameters rather than all parameters. In contrast, both terms in our joint loss correspond to the same task, i.e. translation prediction given a source sentence and its TM; and both models are exactly the same. The detailed joint training algorithm is presented in Algorithm 1. It follows the standard gradient descent method for optimization. Note that in line 2 and 3, it samples two mini-batches which do not share the same bilingual sentences to promote diversity, i.e., D and D 0 are independently and randomly sampled. In our experiments, we employ Adam (Kingma and Ba, 2014) with default settings as the learning rate schema.

Experiments
In this section, we validate the effectiveness of the proposed approach: robustness for handling both translation situations (with or without a TM), running efficiency compared with the previous TMbased NMT models, translation quality on both TM-specialized tasks and general MT tasks. We use the case-insensitive BLEU score as the automatic metric (Papineni et al., 2002) for the translation quality evaluation.

Setup
TM-specialized tasks We evaluate our proposed models with the JRC-Acquis corpora, which include three language pairs and lead to six translation tasks in total: English↔German (En↔De),

Baseline systems
We compare our proposed model with the strong baselines as follows: • TF (Vaswani et al., 2017): it is the standard Transformer. • TF-P (Zhang et al., 2018): it is reimplemented on top of Transformer by ourselves. • TF-G (Xia et al., 2019) and TF-SEQ (Gu et al., 2018): TF-SEQ is a mimic implementation over Transformer by Xia et al. (2019). We report the results from Xia et al. (2019) since they were also implemented over Transformer as comparison. • FM + (Xu et al., 2020): since Xu et al. (2020) adopt a different split on JRC corpus, the results are not comparable to ours. For a fair comparison, we re-implement a strong model FM + as a baseline which makes use of the same metric to retrieve a TM as ours and is better than the method in Bulte and Tezcan (2019).
Our models In the case of the three methods proposed in this paper, TF-S, TF-SS and TF-SA refer to the method encoding TM by the sentence, sentence with score, and sentence with alignment, respectively. We optimize their parameters through both standard training and joint training. For joint training, the hyperparameter λ is set to be 1 for all translation tasks.
System configuration For a fair comparison, we employ the same settings to train all baselines and our models, and the learning rate for all models is Adam with the default hyper-parameters. The details of the settings are shown in Table 2.

Results and Analysis on Es→En Task
Standard training and robustness issue We first evaluate the proposed models under the standard training criterion. Table 3 shows the comparison among different TM encoding methods for our models. From this table, we can see that our models achieve substantial improvements over Transformer (TF) which does not use any TM, even if our models are simple and only utilize a single bilingual sentence in the TM. TF-SA performs better than TF-S and TF-SS thanks to the fine-grained alignment information encoded in the TM. Also, TF-SA outperforms all TM-based baselines by at least 1.0 BLEU point, compared with Table 6. In addition, we exploit the influence of our models on the similarity of a TM. We thereby divide the test dataset into ten subsets according to the similarity score and report the results in Table 4. We find that the gains of our models over the TF baseline are mainly from those sentences whose TMs are with relatively high similarity. To our surprise, our models perform worse than TF on the subset with relatively low similarity except the subset with the lowest similarity. 5 This result demonstrates that our models with standard training are not robust to similarity scores, as deeply explained in the previous section.
Joint training Luckily the robustness issue can be fixed well by joint training, as depicted in the right part of Table 4. We can see that our model is better than the baseline TF on the subset of [0, 0.3), and it substantially outperforms TF on the subset of [0.3, 1). With the help of joint training, TF-SA delivers gains of 1.2 BLEU points over standard training, and gains of 5.7 BLEU points over the strong TF baseline on the entire test set.
Therefore, in the rest of the experiments, we employ joint training to set up all of our models because it is robust to the low similarity of TMs.
Without TM or with Ref as TM The situation without any TM and the situation with reference as a TM are more extreme cases of the robustness issue. As reported in Table 5, if a perfect TM is 5 We further check these two exceptional sentences and find that they are very short in length. In particular, their word alignment results from the fast-align toolkit are very good, which may be beneficial to our proposed model. This might be the reason why our proposed model advances the baseline Transformer.   provided to our models, they can yield excellent translation results. Besides, the proposed methods are not inferior to the standard Transformer when no TM is provided. As a result, the proposed model makes it possible that a single unified model can handle both translation situations (with or without a TM), which is practical for online services.
Noisy TM To validate whether the model works well with noisy TMs, we also conduct a quick experiment by adding noises to TM for the test set by randomly replacing words in the target side of TM with incorrect words. After replacing one and two words, the proposed TF-SA achieves 68.17 BLEU points and 67.94 BLEU points, respectively. Both results are slightly worse than the noise-free TF-SA (68.49) but still better than the best TM baseline (66.21). Note that both results are obtained without retraining TF-SA model with noisy TM. This fact demonstrates our model is even robust to noisy TMs and thus it is useful for the online TM. Table 6 illustrates the results between the proposed model TF-SA and the baselines. It is clearly shown that TF-SA surpasses all TM-based baselines with a substantial margin. In details, TF-SA outperforms TF-P and TF-SEQ by about 3.2 BLEU points, FM + by about 2.6 BLEU points, and the strong baseline TF-G by about 2.2 BLEU points.

Comparison with baselines
Running time Since all TM-based models employ the same retrieval metric and their retrieval   time is exactly the same, we only report the running time of all TM-based NMT models excluding retrieval time in Table 7. As reported in this table, our proposed model further saves significant running time over TF-SEQ and TF-G for both training and testing, besides achieving better translation performance. In addition, although it requires slight overhead in training, its testing is more efficient than TF-P; and our training is faster than FM + .

On the TM-specialized Datasets
The experimental results of all the systems on the six translation tasks of TM-specialized datasets are reported in Table 8. Several observations can be made from the results. First, the baseline TF-P and TF-G achieve substantial gains over the strong baseline TF, outperforming by [1.1, 4.1] BLEU points. This result is in line with the finding in Zhang et al. (2018) and Xia et al. (2019). Second, on the basis of that, compared with the strongest baseline TF-G, our proposed TF-S, TF-SS and TF-SA can obtain further gains up to 4.9 BLEU points, at least 1.2 BLEU points.

On the General WMT Datasets
It is important to mention that all previous TMbased approaches failed in getting notable improvements on the general WMT datasets. Since Xia et al. (2019) did not conduct experiments on the WMT datasets and their implementation is not released, we compare our models with two baselines: TF and TF-P. Our experimental results on the general WMT datasets are reported in  can see, the method TF-P is only comparable to the baseline NMT, which is in line with the observation in Zhang et al. (2018). In contrast, our models perform well on these tasks. Our best model gains about 0.7 BLEU points on the En→De and 1.0 BLEU point on the Zh→En task, over both baselines on average. The experimental results demonstrate that a TM based translation model can advance strong MT baselines on general translation tasks where a TM is not very similar to input source sentences. What's more, as shown in Table 5, our models can get excellent translation results while a perfect TM is provided. In a summary, based on the above extensive experimental results, our proposed models substantially surpass several baselines on TM-specialized tasks and general tasks, in terms of BLEU and running time.

Related Work
In the statistical machine translation (SMT) diagram, Koehn and Senellart (2010a) extract bilingual segments from a TM which matches the source sentence to be translated, and employ a heuristic score to decide whether the extracted segments should be used as decoding constraints or not, then hardly constrain SMT to decode for those unmatched parts of the source sentence. Ma et al. (2011) design a fine-grained classifier, rather than the heuristic score, to predict the score for making more reliable decisions. Simard and Isabelle (2009), Wang et al. (2013) and Wang et al. (2014) add the extracted bilingual segments to the translation table of SMT, and then bias the decoder in a soft constraint manner when decoding the source sentence with the augmented translation table. Liu et al. (2012) use the retrieved bilingual sentences to update the parameters for the log-linear model based SMT.
In recent years, many efforts are made on neural machine translation (NMT) associated with a TM. Li et al. (2016) andFarajian et al. (2017) make full use of the retrieved TM sentence pairs to fine-tune the pre-trained NMT model on-the-fly. The most obvious drawback of fine-tuning is that the delay is too long for testing sentences. To avoid the online tuning process, Zhang et al. (2018) and He et al. (2019) dynamically integrate translation pieces, based on n-grams extracted from the matched segments in the TM target, into the beam search stage. The second type of approach is efficient but heavily depends on the global hyper-parameter λ, which is sensitive to the development set, leading to inferior performance.
Recently, there are notable approaches for the sake of further excavation on TM-based NMT. Bulte and Tezcan (2019) and Xu et al. (2020) propose data augmentation approaches by augmenting input sentences with a TM which do not modify the NMT model architecture. Gu et al. (2018) and Xia et al. (2019) employ an auxiliary network to encode TMs and integrate it into the NMT architecture. Our model architecture is simpler than Gu et al. (2018) and Xia et al. (2019) and we encode a single TM target sentence and utilize simple attention mechanisms on the TM. And the architecture is more efficient and leads to a faster translation speed compared with Gu et al. (2018) and Xia et al. (2019). In particular, we propose a novel training criterion to make the TM-based NMT model more robust in different translation situations (with or without a TM). In parallel with our work, Cai et al. (2021) extend the translation memory from the bilingual setting to the monolingual setting through a cross-lingual retrieval technique, and Khandelwal et al. (2021) report significant improvements in quality on general translation tasks as ours, but their inference speed is two orders of magnitude slower than Transformer because they perform contextual word retrieval whose search space is much larger than that of sentence retrieval.

Conclusion
This paper presents a simple TM-based NMT model that employs a single bilingual sentence as its TM and thus is fast in training and inference. Although the presented model with the standard training outperforms strong TM-based baselines, it suffers from a robustness issue: its performance highly depends on the similarity of a TM. To address this issue, we propose a novel training criterion inspired by multiple-task learning and data augmentation. Experiments on TM-specialized tasks demonstrate its superiority over strong baselines in terms of running time and BLEU. Also, it is shown that a TM-based NMT model can advance the strong Transformer on general translation tasks like WMT.