Towards User-Driven Neural Machine Translation

A good translation should not only translate the original content semantically, but also incarnate personal traits of the original text. For a real-world neural machine translation (NMT) system, these user traits (e.g., topic preference, stylistic characteristics and expression habits) can be preserved in user behavior (e.g., historical inputs). However, current NMT systems marginally consider the user behavior due to: 1) the difficulty of modeling user portraits in zero-shot scenarios, and 2) the lack of user-behavior annotated parallel dataset. To fill this gap, we introduce a novel framework called user-driven NMT. Specifically, a cache-based module and a user-driven contrastive learning method are proposed to offer NMT the ability to capture potential user traits from their historical inputs under a zero-shot learning fashion. Furthermore, we contribute the first Chinese-English parallel corpus annotated with user behavior called UDT-Corpus. Experimental results confirm that the proposed user-driven NMT can generate user-specific translations.


Introduction
In recent years, neural machine translation (NMT) models Luong et al., 2015;Vaswani et al., 2017) have shown promising quality and thus increasingly attracted users. When drawing on a translation system, every user has his own traits, including topic preference, stylistic characteristics, and expression habits, which can be implicitly embodied in their behavior, e.g., the historical inputs of these users. A good translation should implicitly mirror user traits rather than * Jinsong Su is the corresponding author. This work was done when Huan Lin was interning at DAMO Academy, Alibaba Group. 1 We release our source code and the associated benchmark at https://github.com/DeepLearnXMU/ User-Driven-NMT. merely translate the original content, as the example shown in Figure 1. However, current NMT models are mainly designed for the semantic transformation between the source and target sentences regardless of subtle traits with respect to user behavior. It can be said that the effect of user behavior on translation modeling is still far from utilization, which, to some extent, limits the applicability of NMT models in real-world scenarios.
More recently, several studies have shown that the prominent signals in terms of personal characteristics can be served as inductive biases and reflected in translation results using domain adaptation approaches, such as personality (Mirkin et al., 2015), gender (Rabinovich et al., 2017), and politeness (Sennrich et al., 2016a). However, previously explored signals characterize users from a single dimension, which insufficiently represent fine-grained user traits. Furthermore, Michel and Neubig (2018) pay their attention to personalized TED talk translation, in which they train a speakerspecific bias to revise the prediction distribution. In contrast with these studies, our work investigates a more realistic online scenario: a real-world MT system serves extensive users, where the user-behavior annotated data covering all users is unavailable. Previous methods (Mirkin et al., 2015;Michel and Neubig, 2018) require the users in the training set and the test set to be consistent, therefore can not deal with this zero-shot issue.
Starting from this concern, we explore userdriven NMT that generates personalized translations for users unseen in the training dataset according to their behavior. Specifically, we choose the historical inputs to represent user behavior since they can not only be easily obtained in the real-world scenarios, but also reflect the topic preference, stylistic characteristic, and context of user. Moreover, compared with pre-defined or userspecific labels, historical inputs can be updated with current source sentences, which is also in line with realistic scenario.
In this work, we propose a novel framework for this task, where the NMT model is equipped with a cache module to restore and update historical inputs. Besides, in order to further transfer the traits from the seen users to the unseen ones, we design a regularization framework based on contrastive learning (Bose et al., 2018;, which forces our model to decrease the divergence between translations of similar users while increasing the diversity on dissimilar users.
In order to further train and assess the proposed framework, we construct a new User-Driven Machine Translation dataset called UDT-Corpus. This corpus consists of 6,550 users with totally 57,639 Chinese sentences collected from a realworld online MT system. Among them, 17,099 Chinese sentences are annotated with their English translations by linguistic experts according to the user-specific historical inputs. Experimental results demonstrate that the proposed framework facilitates the translation quality, and exactly generates diverse translations for different users.
To summarize, major contributions of our work are four-fold: • We introduce and explore user-driven NMT task that leverages user behavior to enhance translation model. We hope our study can attract more attention to explore techniques on this topic. • We propose a novel framework for user-driven NMT based on cache module and contrastive learning, which is able to model user traits in zero-shot scenarios. • We collect UDT-Corpus and make it publicly available, which may contribute to the subsequent researches in the communities of NMT and user-driven models. • Extensive analyses indicate the effectiveness of our work and verify that NMT can profit from user behavior to generate diverse translations conforming to user traits.

Related Work
This section mainly includes the related studies of personalized machine translation, cache-based NMT and contrastive learning for NMT.
Personalized Machine Translation Recently, some researchers have employed domain adaptation (Zhang et al., 2019;Gururangan et al., 2020;Yao et al., 2020) to generate personalized translations. For example, Mirkin et al. (2015) show that the translation generated by the SMT model has an adverse effect on the prediction of author personalities, demonstrating the necessity of personalized machine translation. Furthermore, Sennrich et al. (2016a) control the politeness in the translation by adding a politeness label on the source side. Rabinovich et al. (2017) explore a gender-personalized SMT system that retains the original gender traits. These domain labels represent users in single dimension separately, which are insufficient to distinguish large-scale users in a fine-grained way. The most correlated work to ours is Michel and Neubig (2018) which introduces a speaker-specific bias into the conventional NMT model. However, these methods are unable to deal with users unseen at the training time. Different from them, user-driven NMT can generate personalized translations for these unseen users in a zero-shot manner.
Cache-Based Machine Translation Inspired by the great success of cache on language modeling (Kuhn and de Mori, 1990;Goodman, 2001;Federico et al., 2008), Nepveu et al. (2004) propose a cache-based adaptive SMT system. Tiedemann (2010) explore a cache-based translation model that fills the cache with bilingual phrase pairs extracted from previous sentence pairs in a document. Bertoldi et al. (2013) use a cache mechanism to achieve online learning in phrase-based SMT. Gong et al. (2011), Kuang et al. (2018), and Tu et al. (2018) further exploit cache-based approaches to leverage contextual information for document-level machine translation. Contrast with the documentlevel NMT that learns to capture contextual information, our study aims at modeling user traits, such as, topic preference, stylistic characteristics, and expression habits. Moreover, historical inputs of user has relatively fewer dependencies than the contexts used in document-level translation.
Contrastive Learning for NMT Contrastive learning has been extensively applied in the communities of computer vision and natural language processing due to its effectiveness and generality on self-supervised learning (Vaswani et al., 2013;Mnih and Kavukcuoglu, 2013;Liu and Sun, 2015;Bose et al., 2018). Towards raising the ability of NMT in capturing global dependencies, Wiseman and Rush (2016) first introduce contrastive learning into NMT, where the ground-truth translation and the model output are considered as the positive and contrastive samples, respectively.  construct contrastive examples by deleting words from ground-truth translation to reduce word omission errors in NMT. Contrast to these studies, we employ contrastive learning to create broader learning signals for our user-driven NMT model, where the prediction distribution of translations with respect to similar users and dissimilar users are considered as positive and contrastive samples, respectively. Thus, our model can better transfer the knowledge of the seen users to the unseen ones.

User-Driven Translation Dataset
In order to build a user-driven NMT system, we construct a new dataset called UDT-Corpus containing 57,639 inputs of 6,550 users, 17,099 among them are Chinese-to-English translation examples.

Data Collection and Preprocessing
We collect raw examples from Alibaba Translate 2 which contain the user inputs and the translations given by the translation system.
For data preprocessing, we first anonymize data and perform data deduplication within each user. Then, we utilize a pre-trained n-gram language model KenLM 3 to filter out translation examples with low-quality source data. Moreover, we remove such pairs whose source sentence is shorter than 2 words or longer than 100 words.

Data Annotation
In the corpus, we represent each translation example as a triplet is the historical inputs of the user u, X (u) is the current source sentence and Y (u) is the target translation sentence annotated with H (u) . To obtain such a triplet, we first sequentially sample up to 10 source sentences which are the historical inputs of each user. Then, for the given historical inputs, we collect their followed source input paired with the pseudo translation given by the translation system. Afterwards, we assign these historical inputs and the current input pairs to two professional annotators and ask them to revise the pseudo translation according to the source sentence and historical inputs. Specifically, we first ask one of them to annotate and the other to evaluate, and then resolve annotation disagreements by reviewing. During annotation, 91.8% of the original data are revised. Moreover, annotators are asked to record whether their revision is affected by user history. The result shows that 76.25% of the sentences are impacted.

User-Driven NMT Framework
In this section, we first give a brief description about the problem formulation of user-driven NMT, and then introduce our proposed framework in detail. We choose Transformer (Vaswani et al., 2017) as the basic NMT model due to its competitive performance. In fact, our framework is transparent and applicable to other NMT models. Figure 2 illustrates the basic framework of the proposed user-driven NMT. Most typically, we equip the NMT model with two user-specific caches to exploit user behavior for better translation (See Section § 4.2). Besides, we augment the conventional NMT training objective with contrastive learning, which allows the model to learn translation diversity across users (See Section § 4.3).

Problem Formulation
Given the source sentence X and the previously generated words Y <i = y 1 , ..., y i−1 , the conventional NMT model with parameter θ predicts the current target word y i by P (y i |X, Y <i ; θ). As a significant extension of conventional NMT, userdriven NMT with parameter θ aims to model P y <i , u; θ , that is, generates the translation that can reflect the traits of user u. Unlike previous studies (Mirkin et al., 2015;Michel and Neubig, 2018) only caring for generating translations for users seen at the training time, our userdriven NMT mainly focuses on a more realistic online MT scenario, where the users for testing are unseen in the training dataset. Moreover, the conventional domain adaptation methods can not be directly applied to this zero-shot scenario.
User-Driven NMT Model Figure 2: The architecture of our user-driven NMT model. We use the topic cache and context cache to capture the long-term and short-term user traits for user u from corresponding historical inputs H (u) , respectively. Then, we combine the representations of two caches to get a user behavior representation r (u) , which is fed into the NMT model for personalized translation. Furthermore, we use contrastive learning involving similar user u + and dissimilar user u − to increase the translation diversity among different users.

Cache-based User Behavior Modeling
Due to the advantages of cache mechanism on dynamic representations (Gong et al., 2011;Kuang et al., 2018;Tu et al., 2018), we equip the conventional Transformer-based NMT model with two user-specific caches to leverage user behavior for NMT: 1) topic cache c (u) t that aims at capturing the global and long-term traits of user u; and 2) context cache c (u) c , which is introduced to capture the short-term traits from the recent source inputs of user u. During this process, we focus on the following three operations on cache: Cache Representation In order to facilitate the efficient computation of the user behavior encoded by our caches, we define each cache as an embedding sequence of keywords. We first calculate TF-IDF values of input words, and then extract words with TF-IDF weights higher than a predefined threshold to represent user behavior.
Note that the calculation of TF-IDF value of a word mainly depends on its frequency in the document and inverse document frequency in the corpus. Since two caches play different roles in the userdriven NMT model, we identify keywords for two caches based on different definitions of "document" and "corpus". Specifically, when constructing topic cache c (u) t , we treat the historical inputs H (u) of the user u as the "document" and the historical inputs H (u) of all users U as the "corpus", then define topic cache c (u) t as an embedding sequence of historical keywords. Unlike the topic cache, for context cache c (u) c , we individually consider the current source sentence X (u) and historical inputs H (u) as the TF-IDF "document" and "corpus", defining c (u) c as an embedding sequence of current keywords.
Besides, in the real-world MT scenario, there exists a large number of users without any historical input. For these users, we find the most similar user according to the cosine similarity based on their TF-IDF bag-of-word representations of topic keywords, and initialize the corresponding topic cache with that of the most similar user.
Updating Caches When using an online MT system, users often continuously input multiple sentences. Thus, our caches should be dynamically updated to ensure the accurate encoding of user behavior.
To update topic cache, we first recalcualte the TF-IDF values of all historical input words, so as to redetermine the keywords stored in this cache. As for context cache, we consider it as a filter window sliding across historical inputs, and apply first-infirst-out rule to replace its earliest keywords with the recently input ones.
Reading from Caches During the translation of the NMT model, we perform a gating operation on c (u) t and c (u) c , producing a vector r (u) that reflects user behavior as follows: where both W t and W r are learnable parameter matrices. Then, we directly add r (u) into the embedding sequence of original current source sentence X (u) , forming a source embedding sequence with user behavior as follows: Finally, the NMT model is fed withX(u) to generate the translation for u. Due to the limitation of pages, we omit the detailed descriptions of the NMT model. Please refer to Vaswani et al. (2017) for the details.

Model Training with a Contrastive Loss
Given training instances X (u) , Y (u) , H (u) , we train the user-driven NMT model using the following objective function: Here, L mle is the maximum likelihood translation loss extended from the conventional NMT training objective. Formally, it is defined as: <i , H (u) ; θ).
(7) L cl is a triplet-margin-based constrastive loss, which allows the NMT model to learn the translation diversity across users.
Specifically, for an input sentence, an ideal userdriven NMT model should be able to generate translations with non-divergent user traits for similar users, while producing translations with diverse user traits for dissimilar users. However, using only L mle cannot guarantee this since it separately considers each training instance during the model training. To deal with this issue, for each training instance X (u) , Y (u) , H (u) , we first determine the most similar user u + according to the cosine similarity based on their bag-of-keyword representations, and randomly select a user without any same keyword as the dissimilar user u − of u. Finally, using historical inputs of u + and u − , we construct several pseudo training instances to define L cl as follows: and η is a predefined threshold, which is set to 2 in our experiments. Here, we omit the definition of Formally, L cl will encourage the NMT model to minimize the prediction difference between the training instances X (u) , Y (u) , H (u) and X (u) , Y (u) , H (u + ) , and maximize the difference between the training instances X (u) , Y (u) , H (u) and X (u) , Y (u) , H (u − ) . In this way, the NMT model can not only exploit pesudo training instances, but also produce more consistent translations with user traits.

Experiments
In this section, we carry out several groups of experiments to investigate the effectiveness of our proposed framework on UDT-Corpus.

Setup
We develop the user-driven NMT model based on Open-NMT Transformer (Klein et al., 2017), and adopt a two-stage strategy to train this model: we first pre-train a Transformer-based NMT model on the WMT2017 Chinese-to-English dataset, and then fine-tune this model to our user-driven NMT model using UDT-Corpus.
Datasets The WMT2017 Chinese-to-English dataset is composed of the News Commentary v12, UN Parallel Corpus v1.0, and CWMT corpora, with totally 25M parallel sentences. To fine-tune our model, we split UDT-Corpus into training, validation and test set, respectively. Table 1 provides more detailed statistics of these datasets. To improve the efficiency of model training, we train the model using only parallel sentences with no more than 100 words. Following common practices, we employ byte pair encoding (Sennrich et al., 2016b) with 32K merge operations to deal with all sentences.
Training Details Following Vaswani et al. (2017), we use the following hyper-parameters: the word embedding dimension is set to 512, the hidden layer dimension is 2048, the layer numbers of both encoder and decoder are set to 6, and the number of attention heads is set to 8. Besides, we use 4 GPUs for training. At the pre-training stage, we employ the Adam optimizer with β 2 = 0.998. We use the batch size of 16,384 tokens and pre-train the model for 200,000 steps. Particularly, we adopt the dropout strategy (Srivastava et al., 2014) with rate 0.1 to enhance the robustness of our model. When fine-tuning the model, we keep the other settings consistent with the pre-training stage, but reduce the batch size to 2048 tokens and fine-tune the model with early-stopping strategy.

Baselines
We represent our user-driven NMT model as UD-NMT and compare it with the following baselines: • TF. It is a Transformer-based NMT model pretrained on the WMT2017 corpus. This model yields 24.61 BLEU score on WMT2017 Chinese-to-English translation task, which is comparable with reported results in , which makes our subsequent experiments convincing. • TF-FT. This model is also a Transformerbased NMT model that is further fine-tuned on the parallel sentences of UDT-Corpus. • TF-FT + PesuData. This model is a variant of TF-FT. When constructing it, we pair historical inputs with their translations produced by our online translation system, forming additional data for fine-tuning TF-FT. • TF-FT + ConcHist (Tiedemann and Scherrer, 2017). In this model, we introduce user behavior into TF-FT by concatenating each input sentence with several historical inputs. We mark all tokens in historical inputs with a special prefix to indicate that they are additional information. • TF-FT + UserBias (Michel and Neubig, 2018). It introduces user-specific biases to refine softmax-based predictions of Transformer NMT model. We change it to a zeroshot method similar to (Farajian et al., 2017) Table 2: Main results on UDT-Corpus. "w/o", "w/" denote "without" and "with", respectively.
since (Michel and Neubig, 2018) can not be directly applied to our scenario. In particular, we replace the user ID in the test set with that of the most similar user in the training set. Note that the first two baselines, e.g., TF and TF-FT, are conventional NMT models without exploiting user behavior.

Effect of Cache Sizes
Since cache size directly determines the utility of user behavior, we investigate its effect on the performance of UD-NMT. We denote the sizes of topic cache and context cache as s t and s c for simplicity. Figure 3 lists the performance of our model with different s t and s c on validation set. We observe that s t larger than 25 and s c larger than 35 do not lead to significant improvements. For this result, we speculate that small cache sizes are unable to capture sufficient user behavior for NMT. However, since the number of keywords are limited, larger cache sizes only bring limited information gain. Therefore, we directly use s t = 25 and s c = 35 in the subsequent experiments.

Main Results
From Table 2, we observe that our UD-NMT model consistently outperforms all baselines in terms of two metrics. Moreover, we draw several interesting conclusions:  Table 3: Ablation Study. ↑: higher is better, ↓: lower is better. Since the user similarity is calculated based on the topic keywords, the model can not find similar user and dissimilar user without it. Thus w/o topic cache does not have the s-BLEU, s-Sim., d-BLEU and d-Sim.. ‡/ †: indicates the drop of translation quality is statistically significant comparing to "UD-NMT" (p<0.01/0.05).
1) All NMT models leveraging user behavior surpass vanilla models, including TF, TF-FT, showing that user behavior is useful for NMT.
2) UD-NMT exhibits better than TF-FT + Pesu-Data, which uses the same training data as ours. The underlying reason is that UD-NMT can leverage user traits to generate better translations.
3) Although both TF-FT + UserBias and UD-NMT exploit user behavior for NMT, UD-NMT achieves better performance than TF-FT + User-Bias without introducing extra parameters. This result demonstrates the advantage of cache on modeling user behavior than introducing user-specific biases into model parameters.

Ablation Study
To explore the effectiveness of different components in our model, we further compare UD-NMT with its several variants, as shown in Table 3.
Particularly, we propose to evaluate translations using the following variant metrics: s-BLEU, s-Sim., d-BLEU and d-Sim.. When using s-BLEU, we replace the topic cache of current user with that of his most similar user. Keeping the same current input, we calculate the BLEU score with ground-truth as reference and the translation for this similar user as hypothesis. As for s-Sim., we adopt the same strategy as s-BLEU, but use the translation for original user as reference to evaluate the BLEU score. In other words, s-BLEU and d-BLEU assesses the translation quality given unsuitable user. Therefore, higher s-BLEU and d-BLEU indicates better model robustness, while s-BLEU and d-BLEU measures how much the translation changes given different user. Thus lower s-Sim. and d-Sim. show larger translation diversity.
Our conclusions are shown as follows: 1) w/o topic cache. To build this variant, we remove topic cache from our model. The result in Line 2 indicates that removing topic cache leads to a performance drop, suggesting that topic cache is useful for modeling user behavior.
2) w/o context cache. Unlike the above variant, we only use topic cache to represent user traits in this variant. According to the results shown in Line 3, we observe that this change results in a significant performance decline of our model, demonstrating that context cache also effectively captures user behavior for NMT. However, the translation diversity among users increases since the model will not be affected by the context cache in this variant, which is the same between different users when calculating s-Sim. and d-Sim..
3) w/o similar user initialization. In this variant, we do not initialize topic caches of the users without historical inputs using that of the most similar users. From Line 4, we observe that the performance of our model degrades without similar user initialization. 4) w/o contrastive learning. In this variant, we remove the contrastive learning from the whole training objective to inspect the performance change of our model. As shown in Line 4, the performance of our model drops, proving that the contrastive learning is important for the training of our model. Moreover, we can infer from Column 6 and 7 that our model can generate diverse translations. Specifically, the translations of dissimilar users has larger diversity than that of similar ones. Furthermore, we conclude that our model is robust, since it still performs well when we replace the topic cache of current user with those of other users (See Column 4 and 5).

Analysis of Contrastive Margin
Inspired by , we argue that the contrastive learning may increase the prediction diversity of our model between users compared with using the MLE loss. To confirm this, we randomly

Translation
Gene chip analysis found that in the CRF mutant , many genes regulated by b arrs are also negatively fed by CRFs toning .

Translation
Gene chip analysis found that in the CRF mutant , many genes regulated by type b arrs are also subject to negative feedback adjustment by CRFs modulating agents . sample 300 examples from the training dataset, and compute the following margin: where d (u + ) (·) is defined in Equation 9. The definition of d (u + ) mle (·) is the same with d(·), the only difference lies in that the NMT model is only trained by the conventional MLE loss. We find that d(·) has a larger margin than d mle (·) on 88% of sampled sentence pairs, with an average margin of 0.19. The results indicate again that the contrastive learning increases the translation diversity.

Qualitative Analysis
In order to intuitively understand how our cache module exactly affects the translations, we feed our model with the same current source sentence but different users, and display the 1-best translations generated by our model. As shown in the Figure 4 (a), our model is able to produce correct but diverse translations according to different topic caches. Moreover, it is interesting to observe that specific topic keywords such as "type b arr", "negatively regulated" and "modulators" are translated to synonymous but "out-of-domain" phrases if the topic cache does not conform to input sentence. On the contrary, the model conversely generates "indomain" translation if the topic cache comes from the same topic of input sentence.

Correlation Order
Proportion UD-NMT > TF-FT + PesuData 86% UD-NMT > TF-FT + UserBias 74% Besides, to further reveal the effect of user behavior, we provide an example in Figure 4 (b), which lists different translations by compared models for the same inputs. The historical inputs indicate that this user may be an apparel seller, since his historical inputs contain the product titles and descriptions of clothing. Thus, the keywords "Wear Resistant" in the source sentence are correlated with this user. However, two baselines translate it to "Waterproof" and "Resistant", respectively. Moreover, TF-FT + UserBias generates a subject-verb-object structured sentence by adding the auxiliary verb "is", which does not conform to the expression habit of the product title. By contrast, with the hint of the keywords in historical inputs, our UD-NMT is able to produce suitable translation consistent with the topic preference of this user.

Manual Evaluation
To further find out weather the improvements of our model are contributed by user traits, we ran-domly sample 100 examples from the test dataset and ask the linguist experts to sort different systems according to the relevance between the generated translations and the historical input. The results in Table 4 show that our model can generate translations more in line with history inputs than baseline models in most cases, proving that our method can make better use of user traits.

Conclusion
We propose user-driven NMT task, which aims to leverage user behavior to generate personalized translations. With the help of cache module and contrastive estimation, we successfully build an end-to-end NMT model that is able to capture potential user traits from their historical inputs and generate diverse translations under a zero-shot learning fashion. Furthermore, we contribute UDT-Corpus, which is the first Chinese-English parallel corpus annotated with user behavior. We expect our study can attract more attention towards this topic. It is a promising direction to explore other behavior in future, such as clickthrough and editing operations. Moreover, following recent advancements in domain adaptation for NMT, we plan to further improve our model via adversial training based knowledge transfer (Zeng et al., 2018;Yao et al., 2020;Su et al., 2021) and dual knowledge transfer (Zeng et al., 2019).