Modeling Bilingual Conversational Characteristics for Neural Chat Translation

Neural chat translation aims to translate bilingual conversational text, which has a broad application in international exchanges and cooperation. Despite the impressive performance of sentence-level and context-aware Neural Machine Translation (NMT), there still remain challenges to translate bilingual conversational text due to its inherent characteristics such as role preference, dialogue coherence, and translation consistency. In this paper, we aim to promote the translation quality of conversational text by modeling the above properties. Specifically, we design three latent variational modules to learn the distributions of bilingual conversational characteristics. Through sampling from these learned distributions, the latent variables, tailored for role preference, dialogue coherence, and translation consistency, are incorporated into the NMT model for better translation. We evaluate our approach on the benchmark dataset BConTrasT (English<->German) and a self-collected bilingual dialogue corpus, named BMELD (English<->Chinese). Extensive experiments show that our approach notably boosts the performance over strong baselines by a large margin and significantly surpasses some state-of-the-art context-aware NMT models in terms of BLEU and TER. Additionally, we make the BMELD dataset publicly available for the research community.


Introduction
A conversation may involve participants that speak in different languages (e.g., one speaking in English and another in Chinese). Fig. 1 shows an example, where the English role R 1 and the Chinese role R 2 are talking about the "boat". The * Work was done when Yunlong Liang Figure 1: An ongoing bilingual conversation example (English⇔Chinese), where the Chinese utterances are presented in pinyin style. R i : Role i. The dashed arrows mark the translation direction. The green and red arrows represent the monolingual and bilingual conversation flow, respectively. Although the translation of Y 5 produced by the "S-NMT" (a context-free sentencelevel NMT system) is reasonable at the sentence level, the coherence of the entire dialogue translation is poor. goal of chat translation is to translate bilingual conversational text, i.e., converting one participant's language (e.g., English) to another's (e.g., Chinese) and vice versa (Farajian et al., 2020). It enables multiple speakers to communicate with each other in their native languages, which has a wide application in industry-level services.
For a conversation, its dialogue history contains rich role preference information such as emotion, style, and humor, which is beneficial to rolerelevant utterance generation (Wu et al., 2020). As shown in Fig. 1, the utterances X 1 , X 3 and X 5 from role R 1 always have strong emotions (i.e., joy) because of his/her preference, and preserving the same preference information across languages can help raise emotional resonance and mutual understanding (Moghe et al., 2020). Meanwhile, there exists semantic coherence in the conversation, as the solid green arrow in Fig. 1, where the utterance X 5 naturally and semantically connects with the dialogue history (X 1∼4 ) on the topic "boat". In addition, the bilingual conversation exhibits translation consistency, where the correct lexical choice to translate the current utterance might have appeared in preceding turns. For instance, the word "sail" in X 1 is translated into "jiàchuán", and thus the word "sailing" in X 3 should be mapped into "jiàchuán" rather than other words (e.g., "hángxíng" 2 ) to maintain translation consistency. On the contrary, if we ignore these characteristics, translations might be role-irrelevant, incoherent, inconsistent, and detrimental to further communication like the translation produced by the "S-NMT" in Fig. 1. Although the translation is acceptable at the sentence level, it is abrupt at the bilingual conversation level.
Apparently, how to effectively exploit these bilingual conversational characteristics is one of the core issues in chat translation. And it is challenging to implicitly capture these properties by just incorporating the complex dialogue history into encoders due to lacking the relevant information guidance (Farajian et al., 2020). On the other hand, the Conditional Variational Auto-Encoder (CVAE) (Sohn et al., 2015) has shown its superiority in learning distributions of data properties, which is often utilized to model the diversity (Zhao et al., 2017), coherence Wan, 2019) andusers' personalities (Bak andOh, 2019), etc. In spite of its success, adapting it to chat translation is non-trivial, especially involving multiple tailored latent variables. 2 The words "jiàchuán" and "hángxíng" express similar meaning. Therefore, in this paper, we propose a model, named CPCC, to capture role preference, dialogue coherence, and translation consistency with latent variables learned by the CVAE for neural chat translation. CPCC contains three specific latent variational modules to learn the distributions of role preference, dialogue coherence, and translation consistency, respectively. Specifically, we firstly use one role-tailored latent variable, sampled from the learned distribution conditioned only on the utterances from this role, to preserve preference. Then, we utilize another latent variable, generated by the distribution conditioned on source-language dialogue history, to maintain coherence. Finally, we leverage the last latent variable, generated by the distribution conditioned on paired bilingual conversational utterances, to keep translation consistency. As a result, these tailored latent variables allow our CPCC to produce role-specific, coherent, and consistent translations, and hence make the bilingual conversation go fluently.
We conduct experiments on WMT20 Chat Translation dataset: BConTrasT (En⇔De 3 ) (Farajian et al., 2020) and a self-collected dialogue corpus: BMELD (En⇔Ch). Results demonstrate that our model achieves consistent improvements in four directions in terms of BLEU (Papineni et al., 2002) and TER (Snover et al., 2006), showing its effectiveness and generalizability. Human evaluation further suggests that our model effectively alleviates the issue of role-irrelevant, incoherent and inconsistent translations compared to other methods. Our contributions are summarized as follows: • To the best of our knowledge, we are the first to incorporate the role preference, dialogue coherence, and translation consistency into neural chat translation. • We are the first to build a bridge between the dialogue and machine translation via conditional variational auto-encoder, which effectively models three inherent characteristics in bilingual conversation for neural chat translation. • Our approach gains consistent and significant performance over the standard context-aware baseline and remarkably outperforms some state-of-the-art context-aware NMT models. • We contribute a new bilingual dialogue corpus (BMELD, En⇔Ch) with manual translations and our codes to the research community. 5713 2 Background

Sentence-Level NMT
Given an input sentence X={x i } M i=1 with M tokens, the model is asked to produce its translation Y ={y i } N i=1 with N tokens. The conditional distribution of the NMT is: where θ are model parameters and y 1:t−1 is the partial translation. (Ma et al., 2020) is formalized as:

Context-Aware NMT
where X <i and Y <i are the preceding context.

Variational NMT
The variational NMT model  is the combination of CVAE (Sohn et al., 2015) and NMT. It introduces a random latent variable z into the NMT conditional distribution: Given a source sentence X, a latent variable z is firstly sampled by the prior network from the encoder, and then target sentence is generated by the decoder: Y ∼ p θ (Y |X, z), where z ∼ p θ (z|X).
As it is hard to marginalize Eq. 1, the CVAE training objective is a variational lower bound of the conditional log-likelihood: ≤ log p(Y |X), where φ are parameters of the posterior network and KL(·) indicates Kullback-Leibler divergence between two distributions produced by prior networks and posterior networks (Sohn et al., 2015;Kingma and Welling, 2013).

Chat NMT
We aim to learn a model that can capture inherent characteristics in the bilingual dialogue history for producing high-quality translations, i.e., using the context for better translations (Farajian  Figure 2: A dialogue example (En⇔Ch) when translating the utterance X 2k+1 where k ∈ [0, |T |−1 2 ] and T is the total number of turns (assumed to be odd here). et al., 2020). Following , we define paired bilingual utterances (X i , Y i ) as a turn in Fig. 2, where we will translate the current utterance X 2k+1 at the (2k + 1)-th turn. Here, we denote the utterance X 2k+1 as X u and its translation Y 2k+1 as Y u for simplicity, where X u ={x i } m i=1 with m tokens and Y u ={y i } n i=1 with n tokens. Formally, the conditional distribution for the current utterance is where C is the bilingual dialogue history.
Before we dig into the details of how to utilize C, we define three types of context in C (as shown in Fig. 2): (1) the set of previous role-specific source-language turns, denoted as C role X ={X 1 , X 3 , X 5 , ..., X 2k+1 } 4 where k ∈ [0, |T |−3 2 ] and T is the total number of turns; (2) the set of previous source-language turns, denoted as C X ={X 1 , X 2 , X 3 , ..., X 2k }; and (3) the set of previous target-language turns, denoted as C Y ={Y 1 , Y 2 , Y 3 , ..., Y 2k }. Fig. 3 demonstrates an overview of our model, consisting of five components: input representation, encoder, latent variational modules, decoder, and training objectives. Specifically, we aim to model both dialogue and translation simultaneously. Therefore, for the input representation ( § 4.1), we incorporate dialogue-level embeddings, i.e., role and dialogue turn embeddings, into the encoder ( § 4.2). Then, we introduce three specific latent variational modules ( § 4.3) to learn the distributions for varied inherent bilingual characteristics. Finally, we elaborate on how to incorporate the three tailored latent variables sampled from 4 C role Y ={Y2, Y4, Y6, ..., Y 2k } is also role-specific utterances of the interlocutor, which is used to model the interlocutor's consistency in the reverse translation direction. Here, we take one translation direction (i.e., En⇒Ch) as an example. . . . X logp(y t | y 1:t-1 , X u , z role , z dia , z tra ) Figure 3: Overview of our CPCC. The latent variables z role , z dia , and z tra are tailored for maintaining the role preference, dialogue coherence, and translation consistency, respectively. The solid grey lines indicate training process responsible for generating {z role , z dia , z tra } from the corresponding posterior distribution predicted by recognition networks. The dashed red lines indicate inference process for generating {z role , z dia , z tra } from the corresponding prior distributions predicted by prior networks. The first Transformer layer is shared with all inputs.

Our Methodology
the distributions into the decoder ( § 4.4) and our two-stage training objectives ( § 4.5).

Input Representation
The CPCC contains three types of inputs: source input X u , target input Y u , and context inputs {C role X , C X , C Y }. Apart from the conventional word embeddings WE and position embeddings PE (Vaswani et al., 2017), we also introduce role embeddings RE and dialogue turn embeddings TE to identify different utterances. Specifically, for X u , we firstly project it into these embeddings. Then, we perform a sum operation to unify them into a single input for each token x i : (2) where 1 ≤ i ≤ m and WE ∈ R |V |×d , RE ∈ R |R|×d and SE ∈ R |T |×d . |V |, |R|, |T |, and d denote the size of shared vocabulary, number of roles, max turns of dialogue, and hidden size, respectively. h 0 ∈ R m×d , similarly for Y u . For each of {C role X , C X , C Y }, we add '[cls]' tag at the head of it and use '[sep]' tag to separate its utterances (Devlin et al., 2019), and then get its embeddings via Eq. 2.

Encoder
The Transformer encoder consists of N e stacked layers and each layer includes two sub-layers: 5 a multi-head self-attention (SelfAtt) sub-layer and a position-wise feed-forward network (FFN) sublayer (Vaswani et al., 2017): h e = FFN(s e ) + s e , {h e , s e } ∈ R m×d , 5 We omit the layer normalization for simplicity, and you may refer to (Vaswani et al., 2017) for more details.
where h e denotes the state of the -th encoder layer and h 0 e denotes the initialized feature h 0 . We prepare the representations of X u and {C role X , C X , C Y } for training prior and recognition networks. For X u , we apply mean-pooling with mask operation over the output h Ne,X e of the N eth encoder layer, i.e., where M X ∈ R m denotes the mask matrix, whose value is either 1 or 0 indicating whether the token is padded . For C role X , as shown in Fig. 3, we follow (Ma et al., 2020) and share the first encoder layer to obtain the context representation. Here, we take the hidden state of '[cls]' as its representation, denoted as h ctx role ∈ R d . Similarly, we obtain representations of C X and C Y , denoted as h ctx X ∈ R d and h ctx Y ∈ R d , respectively. For training recognition networks, we obtain the

Latent Variational Modules
We design three tailored latent variational modules to learn the distributions of inherent bilingual conversational characteristics, i.e., role preference, dialogue coherence, and translation consistency.
Role Preference. To preserve the role preference when translating the role's current utterance, we only encode the previous utterances of this role and produce a role-tailored latent variable z role ∈ R dz , where d z is the latent size. Inspired by (Wang and Wan, 2019), we use isotropic Gaussian distribution as the prior distribution of z role : where I denotes the identity matrix and we have , where MLP(·) and Softplus(·) are multi-layer perceptron and approximation of ReLU function, respectively. (·;·) indicates concatenation operation.
At training, the posterior distribution conditions on both role-specific utterances and the current translation, which contain rich role preference information. Therefore, the prior network can learn a role-tailored distribution by approaching the posterior network via KL divergence (Sohn et al., 2015): Dialogue Coherence. To maintain the coherence in chat translation, we encode the entire sourcelanguage utterances and then generate a latent variable z dia ∈ R dz . Similar to z role , we define its prior distribution as: ). At training, the posterior distribution conditions on both the entire source-language utterances and the translation that provide a dialoguelevel coherence clue, and is responsible for guiding the learning of the prior distribution. Specifically, we define the posterior distribution as: where µ dia and σ dia are calculated as: Translation Consistency. To keep the lexical choice of translation consistent with those of previous utterances, we encode the paired sourcetarget utterances and then sample a latent variable z tra ∈ R dz . We define its prior distribution as: p θ (z tra |X u , C X , C Y ) ∼ N (µ tra , σ 2 tra I) and {µ tra , σ tra } are calculated as: ). At training, the posterior distribution conditions on all paired bilingual dialogue utterances that contain implicit and aligned information, and serves as learning of the prior distribution. Specifically, we define the posterior distribution as: where µ tra and σ tra are calculated as:

Decoder
The decoder adopts a similar structure to the encoder, and each of N d decoder layers contains an additional cross-attention sub-layer (CrossAtt): where h d denotes the state of the -th decoder layer.
As shown in Fig. 3, we obtain the latent variables {z role , z dia , z tra } either from the posterior distribution predicted by recognition networks (training process as the solid grey lines) or from prior distribution predicted by prior networks (inference process as the dashed red lines). Finally, we incorporate {z role , z dia , z tra } into the state of the top layer of the decoder with a projection layer: where W p ∈ R d×(d+3dz) and b p ∈ R d are training parameters, h N d d,t is the hidden state at time-step t of the N d -th decoder layer. Then, o t is fed to a linear transformation and softmax layer to predict the probability distribution of the next target token: where W o ∈ R |V |×d and b o ∈ R |V | are training parameters.

Training Objectives
We apply a two-stage training strategy Ma et al., 2020). Firstly, we train our model on large-scale sentence-level NMT data to minimize the cross-entropy objective: logp θ (y t |X, y 1:t−1 ).
Secondly, we fine-tune it on the chat translation data to maximize the following objective: We use the reparameterization trick (Kingma and Welling, 2013) to estimate the gradients of the prior and recognition networks (Zhao et al., 2017).

Datasets and Metrics
Datasets. We apply a two-stage training strategy, i.e., firstly training on a large-scale sentence-level NMT corpus (WMT20 6 ) and then fine-tuning on chat translation corpus (BConTrasT (Farajian et al., 2020) 7 and BMELD). The details (WMT20 data and results of the first stage) are shown in Appendix A.
BConTrasT. The dataset 8 is first provided by WMT 2020 Chat Translation Task (Farajian et al., 2020), which is translated from English into German and is based on the monolingual Taskmaster-1 corpus (Byrne et al., 2019). The conversations (originally in English) were first automatically translated into German and then manually postedited by Unbabel editors, 9 who are native German speakers. Having the conversations in both languages allows us to simulate bilingual conversations in which one speaker, the customer, speaks in German and the other speaker, the agent, answers in English.

BMELD.
Similarly, based on the dialogue dataset in the MELD (originally in English) (Poria et al., 2019), 10 we firstly crawled the corresponding Chinese translations from this 11 and then manually post-edited them according to the dialogue history by native Chinese speakers, who are postgraduate students majoring in English. Finally, following (Farajian et al., 2020), we assume 50% speakers as Chinese speakers to keep data balance for Ch⇒En translations and build the bilingual MELD (BMELD). For the Chinese, we segment the sentence using Stanford CoreNLP toolkit 12 .

Implementation Details
For all experiments, we follow the Transformer-Base and Transformer-Big settings illustrated in (Vaswani et al., 2017). In Transformer-Base, we use 512 as hidden size (i.e., d), 2048 as filter size and 8 heads in multi-head attention. In Transformer-Big, we use 1024 as hidden size, 4096 as filter size, and 16 heads in multi-head attention. All our Transformer models contain N e = 6 encoder layers and N d = 6 decoder layers and all models are trained using THUMT (Tan et al., 2020) framework. We conduct experiments on the validation set of En⇒De to select the hyperparameters of context length and latent dimension, which are then shared for all tasks. For the results and more details (other hyperparameters setting and average running time), please refer to Appendix B, C, and D.

Comparison Models
Baseline NMT Models. Transformer (Vaswani et al., 2017): the de-facto NMT model that does not fine-tune on chat translation data. Transformer+FT: fine-tuning on the chat translation data after being pre-trained on sentence-level NMT corpus.
Context-Aware NMT Models. Doc-Transformer+FT (Ma et al., 2020): a stateof-the-art document-level NMT model based on Transformer sharing the first encoder layer to incorporate the bilingual dialogue history. Dia-Transformer+FT : using an additional RNN-based (Hochreiter and Schmidhuber, 1997)

Main Results
Overall, we separate the models into two parts in Tab. 2: the Base setting and the Big setting. In each part, we show the results of our re-implemented Transformer baselines, the context-aware NMT systems, and our approach on En⇔De and En⇔Ch.
Results on En⇔De. Under the Base setting, CPCC substantially outperforms the baselines (e.g., "Transformer+FT") by a large margin with 1.70↑ and 1.48↑ BLEU scores on En⇒De and De⇒En, respectively. On the TER, our CPCC achieves a significant improvement of 1.3 points in both language pairs. Under the Big setting, our CPCC also consistently boosts the performance in both direc-tions (i.e., 1.22↑ and 1.47↑ BLEU scores, 0.4↓ and 1.1↓ TER scores), showing its effectiveness. Compared against the strong context-aware NMT systems (underlined results), our CPCC significantly surpasses them (about 1.39∼1.59↑ BLEU scores and 0.6∼0.9↓ TER scores) in both language directions under both Base and Big settings, demonstrating the superiority of our model.

Results on En⇔Ch.
We also conduct experiments on our self-collected data to validate the generalizability across languages in Tab. 2.
Our CPCC presents remarkable BLEU improvements over the "Transformer+FT" by a large margin in two directions by 2.33↑ and 0.91↑ BLEU gains under the Base setting, respectively, and by 2.03↑ and 0.83↑ BLEU gains in both directions under the Big setting. These results suggest that CPCC consistently performs well across languages.
Compared with strong context-aware NMT systems (e.g., "V-Transformer+FT"), our approach notably surpasses them in both language directions under both Base and Big settings, which shows the generalizability and superiority of our model.

Ablation Study
We conduct ablation studies to investigate how well each tailored latent variable of our model works. When removing latent variables listed in Tab. 3, we have the following findings.
(1) All latent variables make substantial contributions to performance, proving the importance of modeling role preference, dialogue coherence, and translation consistency, which is consistent with our intuition that the properties should be beneficial to better translations (rows 1∼3 vs. row 0).
(2) Results of rows 4∼7 show the combination effect of three latent variables, suggesting that the combination among three latent variables has a cumulative effect (rows 4∼7 vs. rows 0∼3).
(3) Row 7 vs. row 0 shows that explicitly modeling the bilingual conversational characteristics significantly outperforms implicit modeling (i.e., just incorporating the dialogue history into encoders), which lacks the relevant information guidance.

Dialogue Coherence
Following (Lapata and Barzilay, 2005;Xiong et al., 2019), we measure dialogue coherence as sentence similarity. Specifically, the representation of each sentence is the mean of the distributed vectors of its words, and the dialogue coherence between two sentences s 1 and s 2 is determined by the cosine similarity: where w is the vector for word w. We use Word2Vec 14 (Mikolov et al., 2013) to learn the distributed vectors of words by training on the monolingual dialogue dataset: Taskmaster-1 (Byrne et al., 2019). And we set the dimensionality of word embeddings to 100.
Tab. 4 shows the cosine similarity on the test set of De⇒En. It reveals that our model encouraged by tailor-made latent variables produces better coherence in chat translation than contrast systems.

Human Evaluation
Inspired by (Bao et al., 2020;Farajian et al., 2020), we use four criteria for human evaluation: (1) Preference measures whether the translation preserves the role preference information; (2) Coherence denotes whether the translation is semantically coherent with the dialogue history; (3) Consistency measures whether the lexical choice of translation is consistent with the preceding utterances; (4) Fluency measures whether the translation is logically reasonable and grammatically correct.  Table 4: Results of dialogue coherence in terms of sentence similarity (De⇒En, Base). The "#-th Pr." denotes the #-th preceding utterance to the current one. " † † " indicates that statistically significant better than the best result of all contrast NMT models (p < 0.01).
We firstly randomly sample 200 examples from the test set of Ch⇒En. Then, we assign each bilingual dialogue history and corresponding 6 generated translations to three human annotators without order, and ask them to evaluate whether each translation meets the criteria defined above. All annotators are postgraduate students and not involved in other parts of our experiments.
Tab. 5 shows that our CPCC effectively alleviates the problem of role-irrelevant, incoherent and inconsistent translations compared with other models (significance test (Koehn, 2004), p < 0.05), indicating the superiority of our model. The interannotator agreement is 0.527, 0.491, 0.556 and 0.485 calculated by the Fleiss' kappa (Fleiss and Cohen, 1973), for preference, coherence, consistency and fluency, respectively, indicating "Moderate Agreement" for all four criteria. We also present some case studies in Appendix H.

Related Work
Chat NMT. It only involves several researches due to the lack of human-annotated publicly available data (Farajian et al., 2020). Therefore, some existing work (Wang et al., 2016;Zhang and Zhou, 2019;Rikters et al., 2020) mainly pays attention to designing methods to automatically construct the subtitles corpus, which may contain noisy bilingual utterances. Recently, Farajian et al. (2020) organize the WMT20 chat translation task and first provide a human postedited corpus, where some teams investigate the effect of dialogue history and finally ensemble their models for higher ranks (Berard et al., 2020;Mohammed et al., 2020;Bao et al., 2020;Moghe et al., 2020). As a synchronizing study, Wang et al. (2021) use multitask learning to auto-correct the translation error, such as pronoun dropping, punctuation dropping, and typos. Unlike them, we focus on explicitly modeling role preference, dialogue coherence, and translation consistency with tailored latent variables to promote the translation quality.
Chat NMT can be viewed as a special case of context-aware NMT, which has attracted many researchers (Gong et al., 2011;Jean et al., 2017;Wang et al., 2017b;Bawden et al., 2018;Miculicich et al., 2018;Kuang et al., 2018;Tu et al., 2018;Kang et al., 2020;Ma et al., 2020) to extend the encoder or decoder for exploring the context impact on translation quality. Although these models can be directly applied to chat translation, they cannot explicitly capture the bilingual conversational characteristics and thus lead to unsatisfactory translations (Moghe et al., 2020). Different from these studies, we focus on explicitly modeling these bilingual conversational characteristics via CVAE for better translations.

Conditional
Variational Auto-Encoder. CVAE has verified its superiority in many fields (Sohn et al., 2015). In NMT,  and Su et al. (2018) extend CVAE to capture the global/local information of source sentence for better results. McCarthy et al. (2020) focus on addressing the posterior collapse with mutual information. Besides, some studies use CVAE to model the correlations between image and text for multimodal NMT (Toyama et al., 2016;Calixto et al., 2019). Although the CVAE has been widely used in NLP tasks, its adaption and utilization to chat translation for modeling inherent bilingual conversational characteristics are non-trivial, and to the best of our knowledge, has never been investigated before.

Conclusion and Future Work
We propose to model bilingual conversational characteristics through tailored latent variables for neural chat translation. Experiments on En⇔De and En⇔Ch directions show that our model notably improves translation quality on both BLEU and TER metrics, showing its superiority and generalizability. Human evaluation further verifies that our model yields role-specific, coherent, and consistent translations by incorporating tailored latent variables into NMT. Moreover, we contribute a new bilingual dialogue data (BMELD, En⇔Ch) with manual translations to the research community. In the future, we would like to explore the effect of multimodality and emotion on chat translation, which has been well studied in dialogue field   Corpus, and WikiMatrix for the En⇔Ch. We firstly filter noisy sentence pairs according to their characteristics in terms of duplication and length (whose length exceeds 80). To pre-process the raw data, we employ a series of open-source/in-house scripts, including full-/half-width conversion, unicode conversation, punctuation normalization, and tokenization . After filtering steps, we generate subwords via joint BPE (Sennrich et al., 2016)

B Implementation Details
For all experiments, we follow two model settings illustrated in (Vaswani et al., 2017), namely Transformer-Base and Transformer-Big. The training step is set to 200,000 and 2,000 for the first stage and the fine-tuning stage, respectively. The batch size for each GPU is set to 4096 tokens. The beam size is set to 4, and the length penalty is 0.6 among all experiments. All experiments in the first stage are conducted utilizing 8 NVIDIA Tesla V100 GPUs, while we use 2 GPUs for the second stage, i.e., fine-tuning. That gives us about 8*4096 and 2*4096 tokens per update for all experiments in the first-stage and second-stage, respectively. All models are optimized using Adam (Kingma and Ba, 2015) with β 1 = 0.9 and β 2 = 0.998, and learning rate is set to 1.0 for all experiments. Label smoothing is set to 0.1. We use dropout of 0.1/0.3 for Base and Big setting, respectively. To alleviate the degeneration problem of the variational framework, we apply KL annealing. The KL multiplier λ gradually increases from 0 to 1 over 10, 000 steps. |R| is set to 2 for En⇔De and 7 for En⇔Ch, respectively. |T | is set to 10. The criterion for selecting hyperparameters is the BLEU score on validation sets for both tasks. The average running time is shown in Tab. 7.

Stages
En⇒De   In the case of blind testing or online use (assumed dealing with En⇒De), since translations of target utterances (i.e., English) will not be given, an inverse De⇒En model is simultaneously trained and used to back-translate target utterances (Bao et al., 2020), similar to all tasks.

C Effect of Context Length
We firstly investigate the effect of context length (i.e., the number of preceding utterances) on our approach under the Transformer Base setting. As shown in the left of Fig. 4, using three preceding source sentences as dialogue history achieves the best translation performance on the validation set (En⇒De). Using more preceding sentences does not bring any improvement and increases the computational cost. This confirms the finding of Tu et al. (2018) and  that longdistance context only has limited influence. Therefore, we set the number of preceding sentences to 3 in all experiments.

D Effect of Latent Dimension
The right of Fig. 4 shows the effect of the latent dimension on translation quality under the Transformer Base setting. Obviously, using latent dimension 32 suffices to achieve superior performance. Increasing the dimension does not lead to any improvements. Therefore, we set the latent dimension to 32 in all experiments.

E KL Divergence
Generally, KL divergence measures the amount of information encoded in a latent variable. In the extreme case where the KL divergence of latent variable z equals to zero, the model completely ignores z, i.e., it degenerates. Fig. 5 shows that the total KL divergence of our model maintains around 0.2∼0.5 indicating that the degeneration problem does not exist in our model and latent variables can play their corresponding roles.

F Case Study
In this section, we show some cases in Fig. 6 and Fig. 7 to investigate the effect of different models.
Role Preference and Dialogue Coherence. As shown in Fig. 6, we observe that the baseline models and the context-aware models except "V-Transformer+FT" cannot preserve the role preference information, e.g., joy emotion, even these  "*-Transformer+FT" models incorporate the bilingual conversational history into the encoder. The "V-Transformer+FT" model produces very slightly emotional elements (e.g., "zěnme?") due to the latent variable over the source sentence capturing relevant preference information. Meanwhile, we find that all comparison models cannot generate a coherent translation. The reason may be that they fail to capture the conversation-level coherence clue, i.e., "boat". By contrast, we explicitly model the two characteristics through tailored latent variables and thus obtain satisfactory results.
Translation Consistency. As shown in Fig. 7, we observe that all comparison models cannot maintain the translation consistency due to the lack of explicitly modeling this characteristic. Our model has the ability to overcome the issue and can keep the correct lexical choice to translate the current utterance that might have appeared in preceding turns, i.e., "jiàchuàn". To sum up, both cases show that our model yields role-specific, coherent, and consistent translations by incorporating tailored latent variables into translators, demonstrating its effectiveness and superiority.