Towards Making the Most of Dialogue Characteristics for Neural Chat Translation

Neural Chat Translation (NCT) aims to translate conversational text between speakers of different languages. Despite the promising performance of sentence-level and context-aware neural machine translation models, there still remain limitations in current NCT models because the inherent dialogue characteristics of chat, such as dialogue coherence and speaker personality, are neglected. In this paper, we propose to promote the chat translation by introducing the modeling of dialogue characteristics into the NCT model. To this end, we design four auxiliary tasks including monolingual response generation, cross-lingual response generation, next utterance discrimination, and speaker identification. Together with the main chat translation task, we optimize the enhanced NCT model through the training objectives of all these tasks. By this means, the NCT model can be enhanced by capturing the inherent dialogue characteristics, thus generating more coherent and speaker-relevant translations. Comprehensive experiments on four language directions (English<->German and English<->Chinese) verify the effectiveness and superiority of the proposed approach.


Introduction
A cross-lingual conversation involves participants that speak in different languages (e.g., one speaking in English and another in Chinese as shown in Fig. 1), where a chat translator can be applied to help participants communicate in their individual native languages. The chat translator converts the language of bilingual conversational text in both directions, e.g. from English to Chinese and vice versa (Farajian et al., 2020). With more international communication worldwide, the chat transla- tion task becomes more important and has a wider range of applications.
In recent years, although sentence-level Neural Machine Translation (NMT) models (Sutskever et al., 2014;Vaswani et al., 2017;Hassan et al., 2018;Yan et al., 2020; have achieved remarkable progress and can be directly used as the chat translator, they often lead to incoherent and speakerirrelevant translations (Mirkin et al., 2015;Wang et al., 2017a;Läubli et al., 2018;Toral et al., 2018) due to ignoring the chat history that contains useful contextual information. To exploit chat history, context-aware NMT models (Tiedemann and Scherrer, 2017;Bawden et al., 2018;Miculicich et al., 2018;Tu et al., 2018;Voita et al., 2018Voita et al., , 2019aWang et al., 2019a;Maruf et al., 2019;Ma et al., 2020, etc) can also be directly adapted to chat translation. However, their performances are usually limited because of lacking the modeling of the inherent dialogue characteristics (e.g., the dialogue coherence and speaker personality), which matter for chat translation task as pointed out by Farajian et al. (2020).
In this paper, we propose a Coherence-Speaker-Aware NCT (CSA-NCT) training framework to improve the NCT model by making use of dialogue characteristics in conversations. Concretely, from the perspectives of dialogue coherence and speaker personality, we design four auxiliary tasks along with the main chat translation task. For dialogue coherence, there are three tasks (two generation tasks and one discrimination task), namely monolingual response generation, cross-lingual response generation, and next utterance discrimination. Specifically, as shown in Fig. 1, (1) the monolingual response generation task aims to generate the coherent corresponding utterance in target language given the dialogue history context of the same language. Similarly, (2) the cross-lingual response generation task is to leverage the dialogue history context in source language to generate the coherent corresponding utterance in target language. Besides the above two generation tasks, (3) the next utterance discrimination task focuses on distinguishing whether the translated text is coherent to be the next utterance of the given dialogue history context. Moreover, for speaker personality, (4) we design the speaker identification task that judges whether the translated text is consistent with the personality of its original speaker. Together with the main chat translation task, the NCT model is optimized through the joint objectives of all these auxiliary tasks. In this way, the model is enhanced to capture dialogue coherence and speaker personality in conversation, which thus can generate more coherent and speaker-relevant translations. We validate our CSA-NCT framework on the datasets of different language pairs: BCon-TrasT (Farajian et al., 2020) (En⇔De 1 ) and BMELD (Liang et al., 2021a) (En⇔Zh 2 ). The experimental results show that our model achieves consistent improvements on four translation tasks in terms of both BLEU (Papineni et al., 2002) and TER (Snover et al., 2006), demonstrating its effectiveness and generalizability. Human evaluation further suggests that our model can generate more coherent and speaker-relevant translations compared to the existing related methods.
Our contributions are summarized as follows: • To the best of our knowledge, we are the first to incorporate the dialogue coherence and speaker personality into neural chat translation. • We propose a multi-task learning framework with four auxiliary tasks to help the NCT model generate more coherent and speakerrelevant translations. • Extensive experiments on datasets of different language pairs demonstrate that our model with multi-task learning achieves the state-ofthe-art performances on the chat translation task and significantly outperforms the existing sentence-level/context-aware NMT models. 3

Background
Sentence-Level NMT. Given an input sequence The model is optimized through the following objective: (1) Context-Aware NMT. As in (Ma et al., 2020), given a paragraph of input sentences D X = {X j } J j=1 in source language and its corresponding translations D Y = {Y j } J j=1 in target language with J paired sentences, the training objective of a context-aware NMT model can be formalized as where X <j and Y <j are the preceding contexts of the j-th input source sentence and the j-th target translation, respectively.

CSA-NCT Training Framework
In this section, we introduce the proposed CSA-NCT training framework, which aims to improve the NCT model with four elaborately designed auxiliary tasks. In the following subsections, we first describe the problem formalization ( § 3.1) and the NCT model ( § 3.2). Then, we introduce each auxiliary task in detail ( § 3.3). Finally, we elaborate the process of training and inference ( § 3.4).

Problem Formalization
In the scenario of this paper, the chat involves two speakers (sx and sy) speaking in two languages. As shown in Fig. 1, we assume the two speakers have alternately given utterances in their individual languages for u turns, resulting in X 1 , X 2 , X 3 , X 4 , X 5 , ..., X u−1 , X u and Y 1 , Y 2 , Y 3 , Y 4 , Y 5 , ..., Y u−1 , Y u on the source and target sides, respectively. Among these utterances, X 1 , X 3 , X 5 , ..., X u are originally spoken by the speaker sx and Y 1 , Y 3 , Y 5 , ..., Y u are the corresponding translations in target language. Analogously, Y 2 , Y 4 , Y 6 , ..., Y u−1 are originally spoken by the speaker sy and X 2 , X 4 , X 6 , ..., X u−1 are the translated utterances in source language.
According to languages, we define the dialogue history context of X u on the source side as C Xu ={X 1 , X 2 , X 3 , X 4 , X 5 , ..., X u−1 } and that of Y u on the target side as According to original speakers, on the target side, we define the speaker sx-specific dialogue history context of Y u as the partial set of its preceding utterances C sx Yu ={Y 1 , Y 3 , Y 5 , ..., Y u−2 } and the speaker sy-specific dialogue history context of Y u as C sy Yu ={Y 2 , Y 4 , Y 6 , ..., Y u−1 }. 4 Based on the above formulations, the goal of an NCT model is to translate X u to Y u with certain types of dialogue history context. 5 Next, we will descibe the NCT model in our CSA-NCT training framework.

The NCT Model
The NCT model is based on transformer (Vaswani et al., 2017), which is composed of an encoder and a decoder as shown in Fig. 2.
Encoder. Following (Ma et al., 2020), the encoder takes [C Xu ; X u ] as input, where [; ] denotes the concatenation. In addition to the conventional embedding layer with only word embedding WE and position embedding PE, we additionally add a speaker embedding SE and a turn embedding TE. The final embedding B(x i ) of the input word x i can be written as where WE ∈ R |V |×d , SE ∈ R 2×d and TE ∈ R |T |×d . 6 Figure 2: Architecture of the proposed CSA-NCT framework. The right part is the general NCT model, which is enhanced by four auxiliary tasks. The four auxiliary tasks including monolingual response generation (MRG), cross-lingual response generation (CRG), next utterance discrimination (NUD), and speaker identification (SI), are proposed to improve the coherence and speaker relevance of chat translation, which are presented in Fig. 3 in detail.
Then, the embedding is fed into the NCT encoder that has L identical layers, each of which is composed of a self-attention (SelfAtt) sub-layer and a feed-forward network (FFN) sub-layer. 7 Let h l e denote the hidden states of the l-th encoder layer, it is calculated as the following equations: where h 0 e is initialized as the embedding of input words. Particularly, words in C Xu can only be attended to by those in X u at the first encoder layer while C Xu is masked at the other layers, which is the same implementation as in (Ma et al., 2020).
Decoder. The decoder also consists of L identical layers, each of which additionally includes a cross-attention (CrossAtt) sub-layer compared to the encoder. Let h l d denote the hidden states of the l-th decoder layer, it is computed as where h L e is the top-layer encoder hidden states. At each decoding time step t, h L d,t is fed into a linear transformation layer and a softmax layer to predict the probability distribution of the next target token: where Y u,<t denotes the preceding tokens before the t-th time step in the utterance Y u , W o ∈ R |V |×d and b o ∈ R |V | are trainable parameters. Finally, the training objective is as follows:

Auxiliary Tasks
We elaborately design four auxiliary tasks to incorporate the modeling of dialogue characteristics. The four auxiliary tasks are divided into two groups. The first group is for dialogue coherence modeling while the second is for speaker personality modeling. Together with the main chat translation task, the NCT model can be enhanced to generate more coherent and speaker-relevant translations through multi-task learning.

Dialogue Coherence Modeling
Many studies (Kuang et al., 2018;Wang et al., 2019b;Xiong et al., 2019;Wang and Wan, 2019; have indicated that the modeling of global textual coherence can lead to more coherent text generation. Inspired by this, we add two response generation tasks and an utterance discrimination task during the NCT model training. All the three tasks are related to the dialogue coherence of conversations, thus introducing the modeling of dialogue coherence into the NCT model.

Monolingual Response Generation (MRG).
As illustrated in Fig. 3(a), given the dialogue history context C Yu in target language, the MRG task forces the NCT model to generate the corresponding utterance Y u coherent to C Yu . Particularly, we first use the encoder of the NCT model to encode C Yu , and then use the NCT decoder to predict Y u . The training objective of this task can be formulated as: where h L d,t is the top-layer decoder hidden state at the t-th decoding step, W m and b m are trainable parameters.

Cross-lingual Response Generation (CRG).
The CRG task is similar to the MRG as shown in Fig. 3(b), where the NCT model is trained to generate the corresponding utterance Y u in target language which is coherent to the given dialogue history context C Xu in source language. We first use the encoder of the NCT model to encode C Xu , and then use the NCT decoder to predict Y u . The training objective of this task can be formulated as: where h L d,t denotes the top-layer decoder hidden state at the t-th decoding step, W crg and b crg are trainable parameters.
Note that in the above two response generation tasks, we use the same set of NCT model parameters except for the softmax layer (i.e., W m , b m , W c and b c ).
Next Utterance Discrimination (NUD). As shown in Fig. 3(c), we design the NUD task to distinguish whether the translated text is coherent to be the next utterance of the given dialogue history context. Concretely, we construct positive and negative samples of context-utterance pairs from the chat corpus. A positive sample (C Yu , Y u + ) with the label = 1 consists of the target utterance Y u and its dialogue history context C Yu . A negative sample (C Yu , Y u − ) with the label = 0 consists of the identical C Yu and a randomly selected utterance Y u − from the training set. Formally, the training objective of NUD is defined as follows: For a training sample (C Yu , Y u ), to estimate the probability in Eq. 4 for discrimination, we first obtain the representations H Yu of the target utterance Y u and H C Yu of the given dialogue history context C Yu using the NCT encoder. Specifically, H Yu is calculated as 1 |Yu| |Yu| t=1 h L e,t while H C Yu is defined as the encoder hidden state h L e,0 of the prepended special token '[cls]' of C Yu . Then, the concatenation of H Yu and H C Yu is fed into a binary NUD classifier, which is an extra fully-connected layer on top of the NCT encoder: where W n is the trainable parameter of the NUD classifier and the bias term is omitted for simplicity.

Speaker Personality Modeling
A dialogue always involves speakers who have different personalities, which is a salient characteristic of conversations. Therefore, we design a speaker identification task that incorporates the modeling of speaker personality into the NCT model, making the translated utterance more speaker-relevant.
Speaker Identification (SI). As explored in (Bak and Oh, 2019;Liang et al., 2021b;Lin et al., 2021), the history utterances of a speaker can reflect a distinctive personality. Fig. 3(d) depicts the SI task in detail, where the NCT model is used to distinguish whether a translated utterance and a given speaker-specific history utterances are spoken by the same speaker. We also construct positive and negative training samples from the chat corpus. A positive sample (C sx Yu , Y u ) with the label = 1 consists of the target utterance Y u and the speaker sx-specific history context C sx Yu , because Y u is the translation of the utterance originally spoken by the speaker sx. A negative sample

9
Compute L MRG , L CRG , L NUD , L SI , and L NCT .

10
Update the parameters of the CSA-NCT model with respect to J using Adam.
Yu , Y u ) with the label = 0 consists of the target utterance Y u and the speaker sy-specific history context C sy Yu . Formally, the training objective of SI is defined as follows: For a training sample (C s Yu , Y u ) with s∈{sx, sy}, we also use the NCT encoder to obtain the representations H Yu of the target utterance Y u and H C s Yu of the given speaker-specific history context C s Yu . Similar to the NUD task, H Yu = 1 |Yu| |Yu| t=1 h L e,t and the h L e,0 of C s Yu is used as H C s Yu . Then, to estimate the probability in Eq. 5, the concatenation of H Yu and H C s Yu is fed into a binary SI classifier, which is another fully-connected layer on top of the NCT encoder: where W s is the trainable parameter of the SI classifier and the bias term is also omitted.

Training and Inference
For training, with the main chat translation task and four auxiliary tasks, the total training objective is finally formulated as where α and β are balancing hyper-parameters for the trade-off between L NCT and the other auxiliary objectives. Algorithm 1 summarizes the training procedure of the above multi-task learning process, where θ refers to the parameters of our NCT model and Θ refers to the whole set of parameters including both θ and the parameters of the additional classifiers for auxiliary tasks. During inference, the four auxiliary tasks are not involved and only the NCT model (θ) is used to conduct chat translation.

Datasets and Metrics
Datasets. As shown in Algorithm 1, the training of our CSA-NCT framework consists of two stages: (1) pre-train the model on a large-scale sentencelevel NMT corpus (WMT20 8 ); (2) fine-tune on the chat translation corpus (BConTrasT (Farajian et al., 2020) and BMELD (Liang et al., 2021a)). The dataset details (e.g., splits of training, validation or test sets) are described in Appendix A. Metrics. For fair comparison, we use the Sacre-BLEU 9 (Post, 2018) and TER (Snover et al., 2006) with the statistical significance test (Koehn, 2004). For En⇔De, we report case-sensitive score following the WMT20 chat task (Farajian et al., 2020). For Zh⇒En, we report case-insensitive score. For En⇒Zh, the reported SacreBLEU is at the character level.

Implementation Details
In this paper, we adopt the settings of standard Transformer-Base and Transformer-Big in (Vaswani et al., 2017) and follow the main setting in (Liang et al., 2021a). Specifically, in Transformer-Base, we use 512 as hidden size (i.e., d), 2048 as filter size and 8 heads in multihead attention. In Transformer-Big, we use 1024 as hidden size, 4096 as filter size, and 16 heads in multihead attention. All our Transformer models contain L = 6 encoder layers and L = 6 decoder layers and all models are trained using THUMT (Tan et al., 2020) framework. The training step for the first pre-training stage is set to T 1 = 200,000 while that of the second fine-tuning stage is set to T 2 = 5,000. The batch size for each GPU is set to 4096 tokens. All experiments in the first stage are conducted utilizing 8 NVIDIA Tesla V100 GPUs, while we use 4 GPUs for the second stage, i.e., fine-tuning. That gives us about 8*4096 and 4*4096 tokens per update for all experiments in the first-stage and second-stage, respectively. All models are optimized using Adam (Kingma and Ba, 2014) with β 1 = 0.9 and β 2 = 0.998, and learning rate is set to 1.0 for all experiments. Label smoothing is set to 0.1. We use dropout of 0.1/0.3 for Base and Big setting, respectively. |T | is set to 10. Following (Liang et al., 2021a), we set the number of preceding sentences to 3 in all experiments. The criterion for selecting hyper-parameters is the BLEU score on validation sets for both tasks. During inference, the beam size is set to 4, and the length penalty is 0.6 among all experiments.

Effect of α and β
We also investigate the effect of balancing factor α and β, where α and β gradually decrease from 1 to 0 over 5,000 steps, which is similar to . "Fixed α and β" means we keep α = β = 1 across the training. "Dynamic α and β" denotes decaying α and β with the training step of auxiliary tasks. The results of Tab. 1 show that "Dynamic α and β" gives better performance than "Fixed α and β". Therefore, we apply this dynamic strategy in the following experiments.

Comparison Models
Baseline Sentence-Level NMT Models.
• Transformer (Vaswani et al., 2017) Table 2: Results on the test sets of BConTrasT (En⇔De) and BMELD (En⇔Zh) in terms of BLEU (%) and TER (%). The best and the second results are bold and underlined, respectively. " † " and " † † " indicate that statistically significant better than the best result of all contrast NMT models with t-test p < 0.05 and p < 0.01, respectively. All "+FT" models apply the same two-stage training strategy with our CSA-NCT model for fair comparison.
chat translation data after being pre-trained on sentence-level NMT corpus.
Existing Context-Aware NMT Systems.
• Dia-Transformer+FT : The original model is RNN-based and an additional encoder is used to incorporate the mixed-language dialogue history. We reimplement it based on Transformer where an additional encoder layer is used to introduce the dialogue history into NMT model. • Doc-Transformer+FT (Ma et al., 2020): A state-of-the-art document-level NMT model based on Transformer sharing the first encoder layer to incorporate the dialogue history. • Gate-Transformer+FT : A document-aware Transformer that uses a gate to incorporate the context information. Note that we share the Transformer encoder to obtain the context representation instead of utilizing the additional context encoder, which performs better in our experiments.

Main Results
In Tab. 2, We report the main results on En⇔De and En⇔Zh under Base and Big settings. For comparison, as in § 4.4, "Transformer" and "Transformer+FT" are sentence-level baselines while "Dia-Transformer+FT", "Doc-Transformer+FT" and "Gate-Transformer+FT" are the existing context-aware NMT systems re-implemented by us. Particularly, "CSA-NCT" represents our proposed approach.
Results on En⇔De. Under the Base setting, our model substantially outperforms the sentencelevel/context-aware baselines by a large margin (e.g., the previous best "Gate-Transformer+FT"), 1.02↑ on En⇒De and 1.12↑ on De⇒En. In term of TER, CSA-NCT also performs better on the two directions, 0.9↓ and 0.7↓ lower than "Gate-Transformer+FT" (the lower the better), respectively. Under the Big setting, on En⇒De and De⇒En, our model consistently surpasses the baselines and other existing systems again.
Results on En⇔Zh. We also conduct experiments on the BMELD dataset. Concretely, on En⇒Zh and Zh⇒En, our model also presents no-  Table 3: Ablation results on the validation sets of each auxiliary task group under the Big setting. "Baseline" represents the NCT model without any auxiliary task. "DCM": dialogue coherence modeling, including MRG, CRG, NUD. "SPM": speaker personality modeling, i.e., SI. " † " and " † † " indicate the improvement over the result of the baseline model is statistically significant with p < 0.05 and p < 0.01), respectively.  Table 4: Ablation results on the validation sets of each auxiliary task under the Big setting. " † " and " † † " indicate the improvement over the result of the baseline model is statistically significant with p < 0.05 and p < 0.01, respectively.
We have the following findings: (1) DCM substantially improves the NCT model in terms of both BLEU and TER metrics, which demonstrates modeling coherence is beneficial for better translations.
(2) SPM makes slight contributions to the NCT model in terms of BLEU, which is less significant than DCM. However, further human evaluation in § 5.3 will show that our model can keep the personality consistent with the original speaker.
Effect of Each Auxiliary Task. We also investigate the effect of each auxiliary task by adding a single task at a time. In Tab. 4, rows 1∼4 denote singly adding on the corresponding auxiliary task with the main chat translation task, each of which shows a positive impact on the model performance (rows 1∼4 vs. row 0).

Dialogue Coherence
Following (Lapata and Barzilay, 2005;Xiong et al., 2019), we measure dialogue coherence as sentence similarity, which is determined by the cosine similarity between two sentences s 1 and s 2 : sim(s 1 , s 2 ) = cos(f (s 1 ), f (s 2 )),   where f (s i ) = 1 |s i | w∈s i (w) and w is the vector for word w. We use Word2Vec 10 (Mikolov et al., 2013) trained on a dialogue dataset 11 to obtain the distributed word vectors whose dimension is set to 100.
Tab. 5 shows the measured coherence of different models on the test set of BConTrasT in De⇒En direction. It shows that our CSA-NCT produces more coherent translations compared to baselines and other existing systems (significance test, p < 0.01).

Human Evaluation
Inspired by (Bao et al., 2020;Farajian et al., 2020), we use three criteria for human evaluation: (1) Coherence measures whether the translation is semantically coherent with the dialogue history; (2) Speaker measures whether the translation preserves the personality of the speaker; (3) Fluency measures whether the translation is fluent and gram-matically correct.
First, we randomly sample 200 conversations from the test set of BMELD in Zh⇒En direction. Then, we use the 6 models in Tab. 6 to generate the translated utterances of these sampled conversations. Finally, we assign the translated utterances and their corresponding dialogue history utterances in target language to three postgraduate human annotators, and ask them to make evaluations from the above three criteria.
The results in Tab. 6 show that our model generates more coherent, speaker-relevant, and fluent translations compared with other models (significance test, p < 0.05), indicating the superiority of our model. The inter-annotator agreements calculated by the Fleiss' kappa (Fleiss and Cohen, 1973) are 0.506, 0.548, and 0.497 for coherence, speaker and fluency, respectively, indicating "Moderate Agreement" for all four criteria. We also present one case study in Appendix B.

Related Work
Chat NMT. Little prior work is available due to the lack of human-annotated publicly available data (Farajian et al., 2020). Therefore, some existing studies (Wang et al., 2016;Zhang and Zhou, 2019;Rikters et al., 2020) mainly pay attention to designing methods to automatically construct the subtitle corpus, which may contain noisy bilingual utterances. Recently, Farajian et al. (2020) organize the WMT20 chat translation task and first provide a chat corpus post-edited by humans. More recently, based on document-level parallel corpus, Wang et al. (2021) propose to jointly identify omissions and typos within dialogue along with translating utterances by using the context. As a concurrent work, Liang et al. (2021a) provide a clean bilingual dialogue dataset and design a variational framework for NCT. Different from them, we focus on introducing the modeling of dialogue coherence and speaker personality into the NCT model with multi-task learning to promote the translation quality.
Context-Aware NMT. In a sense, chat MT can be viewed as a special case of context-aware MT that has many related studies (Gong et al., 2011;Jean et al., 2017;Wang et al., 2017b;Kang et al., 2020;Ma et al., 2020). Typically, they resort to extending conventional NMT models for exploiting the context. Although these models can be directly applied to the chat translation scenario, they cannot explicitly capture the inherent dialogue characteristics and usually lead to incoherent and speaker-irrelevant translations.

Conclusion
In this paper, we propose to enhance the NCT model by introducing the modeling of the inherent dialogue characteristics, i.e., dialogue coherence and speaker personality. We train the NCT model with the four well-designed auxiliary tasks, i.e., MRG, CRG, NUD and SI. Experiments on En⇔De and En⇔Zh show that our model notably improves translation quality on both BLEU and TER metrics, showing its superiority and generalizability. Human evaluation further verifies that our model yields more coherent and speaker-relevant translations.  BConTrasT. The dataset 12 is first provided by WMT 2020 Chat Translation Task (Farajian et al., 2020), which is translated from English into German and is based on the monolingual Taskmaster-1 corpus (Byrne et al., 2019). The conversations (originally in English) were first automatically translated into German and then manually postedited by Unbabel editors 13 who are native German speakers. Having the conversations in both languages allows us to simulate bilingual conversations in which one speaker (customer), speaks in German and the other speaker (agent), responds in English.
BMELD. The dataset is a recently released English⇔Chinese bilingual dialogue dataset, provided by Liang et al. (2021a). Based on the dialogue dataset in the MELD (originally in English) (Poria et al., 2019) 14 , they firstly crawled the corresponding Chinese translations from https: //www.zimutiantang.com/ and then manually post-edited them according to the dialogue history by native Chinese speakers who are postgraduate students majoring in English. Finally, following (Farajian et al., 2020), they assume 50% speakers as Chinese speakers to keep data balance for Zh⇒En translations and build the bilingual MELD (BMELD). For the Chinese, we follow them to segment the sentence using Stanford 12 https://github.com/Unbabel/BConTrasT 13 www.unbabel.com 14 The MELD is a multimodal emotionLines dialogue dataset, each utterance of which corresponds to a video, voice, and text, and is annotated with detailed emotion and sentiment.

B Case Study
In this section, we deliver an illustrative case in Fig. 4 to show different outputs among the comparison models and ours.

Dialogue Coherence and Speaker Personality.
For the case in Fig. 4, we find that all comparison models cannot generate coherent translated utterences. The reason may be that they fail to capture contextual clues, i.e., "boat". By contrast, we explicitly introduce the modeling of preceding context through auxiliary tasks and thus obtain satisfactory results. Meanwhile, we observe that the sentence-level models and the context-aware models cannot preserve the speaker personality information, e.g., joy emotion, even though context-aware models incorporate the bilingual conversational history into the encoder.
The case shows that our CSA-NCT model enhanced by the four auxiliary tasks yields coherent and speaker-relevant translations, demonstrating its effectiveness and superiority.