Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings

Neural Machine Translation (NMT) has shown a strong ability to utilize local context to disambiguate the meaning of words. However, it remains a challenge for NMT to leverage broader context information like topics. In this paper, we propose heterogeneous ways of embedding topic information at the sentence level into an NMT model to improve translation performance. Specifically, the topic information can be incorporated as pre-encoder topic embedding, post-encoder topic embedding, and decoder topic embedding to increase the likelihood of selecting target words from the same topic of the source sentence. Experimental results show that NMT models with the proposed topic knowledge embedding outperform the baselines on the English -> German and English -> French translation tasks.


Introduction
Neural Machine Translation (NMT) utilizes local context captured from the mapping between the bitexts to disambiguate the meaning of words. While existing NMT models can handle meaning ambiguities based on local contexts learned from explicit collocations, it remains a challenge for NMT to produce accurate results for words presented in implicit collocations. The notion of implicit collocation is referred to as the circumstance when the meaning of two or more words can not be learned from the available training data; broader context information like topics may be utilized to generate an accurate meaning. For example, in the sentence "he likes bank fishing", the word "bank fishing" is likely to produce an ill-translated Chinese word "银行钓鱼" due to a lack of collocation of "bank (河岸)" and "fishing (钓鱼)". An accurate translation "河岸钓鱼" may be approachable when the shared topic ("recreational sport") is leveraged. * Co-first authors. 1 The codes are available at https://github.com/Vicky-Wil/topic-NMT Incorporating topic information into NMT has been explored in Zhang et al. (2016) and Wei et al. (2019) with both studies adapting Latent Dirichlet Allocation (LDA) (Blei et al., 2003) to model topics of source and target languages. Both works utilized the traditional encoder-decoder architecture with gated recurrent units (GRU) (Cho et al., 2014). Although Wei et al. (2019) showed that topic knowledge incorporation is also applicable to the Transformer architecture (Vaswani et al., 2017), it is argued that the joint learning of topic modeling and NMT is not an ideal way. The training of topic models can leverage a large volume of easily accessible monolingual data. Once a topic model is learned, it can be reused in different translation scenarios without retraining NMT models. Therefore decoupling topic modeling and NMT is a more flexible and scalable option.
In this paper, we propose heterogeneous ways of incorporating topic information into the Transformer architecture. Specifically, the topic information can be incorporated in a heterogeneous manner, namely as pre-encoder topic embedding (EN C pre ), post-encoder topic embedding (EN C post ), and decoder topic embedding (DEC). Besides, the topic distribution learned for each word (as its topic embedding) is summarized at the sentence level and fed into the NMT model. The intuition is that aggregating topic distribution at the sentence level produces more accurate topic information than at the word level. This enables topic modeling to consider contexts conveyed in a sentence. Each target word is generated with the guidance of the topic information of both source and target sentences. The topic-enhanced NMT models are trained on WMT14 English → German translation task and tested on a range of WMT datasets. Experimental results show that our approach can significantly improve translation quality with the topic embedding by achieving up to +1.57 BLEU score improvement over the Transformer baselines. The effect of the proposed method is also verified on the English → French translation task.

Related Work
Many studies have focused on using topic information as explicit prior knowledge to help model learn sentence representations on NLP tasks, such as Zhang et al. (2017); Kim (2014); Kobus et al. (2017). Topic modeling has shown its effectiveness in statistical machine translation (SMT) models (Xiao et al., 2012;Xiong et al., 2015;Hasler et al., 2014). Incorporating topic information into NMT has recently been explored by Chen et al. Both studies adopted LDA to model topics of source and target languages. Dieng et al. (2020) pointed out that LDA is not an effective learner for data with an extensive vocabulary because one has to remove the most and least frequent words to fit good topic models. This pruning practice limits the scope of LDA models. The embedding topic model (ETM) (Dieng et al., 2020) was proposed to model each term as an embedding and each topic as a point in that embedding space. The per-topic conditional probability of a term has a log-linear form to preserve low-dimensional representation of the vocabulary so that ETM can discover interpretable topics with large vocabularies, including rare words and stop words. In this study, we apply ETM to handle issues associated with large vocabularies.  used a variant of convolutional neural networks (CNN) to learn latent topic representations implicitly from sentence-level context. An additional multi-head attention module is directly involved in learning the attentions between topics and targeting words independently from the encoding of the Transformer.  also tried an explicit topic representation computed by TF-IDF, but did not perform better than their latent version. In this paper, we propose multiple heterogeneous ways of explicitly integrating topic information into NMT, resulting in better performance.
3 Topic-enhanced Neural Machine Translation Figure 1 illustrates the proposed topic-enhanced NMT model with topic EN C pre , EN C post , and DEC, built upon the Transformer architecture. The topic knowledge in the figure is obtained from the topic embedding tables for source and target languages produced by ETM.

Pre-encoder Topic Embedding
In the encoding phase, we convert the sequence of words into a sequence of word embedding x i and a sequence of topic embedding t i , as shown in Figure 2(a). The word embedding is obtained by looking up the word embedding table, which is randomly initialized and updated with training. The topic embedding table is pre-calculated as the intermediate product of ETM, and it is fixed during the NMT training process. Then we add up all the topic embedding in the sequence to produce the topic information distribution of the whole sentence topic s , added to each word embedding of the input source words. Finally, we take the added word embedding representation e i as the input embedding and feed it into the encoder with positional encoding results.

Post-encoder Topic Embedding
The topic information distribution can also be added to each corresponding output of the encoder. The NMT decoder can implicitly attend to the topic distributions of each source word in this way. The topic-enhanced hidden state computes the topic context vector as:

Decoder Topic Embedding
The topic information can be incorporated at the decoder side as shown in Figure 2(b). At time step j − 1, we get the topic embedding topic j−1 by adding the topic representation t j−1 to the previous topic embedding topic j−2 . By looking up the output word y j−1 in the target language topic embedding table, we get the topic representation t j−1 . Then the topic decoder embedding at time j − 1 topic j−1 is added to the previous output token y j−1 to participate in the decoding process. At time step j, the topic decoder is used to generate the target word y j . Accordingly, the j-th hidden state of the topic decoder s j is updated as: MODEL BLEU Transformer (base) (Vaswani et al., 2017) 27.3 Transformer (big) (Vaswani et al., 2017) 28.4 Evolved Transformer (So et al., 2019) 28.4 DPE-NMT (Li et al., 2020) 27.61 Transformer base + PR (Xu et al., 2020) 28.67 Fairseq (baseline) (Ott et al., 2019) 27.44 BLT-NMT (Wei et al., 2019) 27.93 LTR-NMT  28.18 Topic-enhanced NMT (ours) 29.01  The ⊕ is to add the sentence topic information to the word embedding to generate the input embedding. (b) The s <j denotes the hidden state of decoder, y j−1 is the output token at the j − 1 step, the t j−1 is the topic embedding for token y j−1 , and the c j is a context vector.
where c j is the context vector obtained by attention mechanism, e(·) is the topic embedding table at the target side, f (·) is the non-linear calculation function of decoder. Consequently, the topic decoder can utilize the topic knowledge of previously generated target words with the topic information of the source sentence to increase the likelihood of selecting words from the same topic.   et al., 2003) and Moses (Koehn et al., (2007) to tokenize all the data. Besides, we use both source and target vocabularies with 32K most frequent words for DE and 44K words for FR.
Training Details We preprocess the corpus for all experiments of ETM. We set the number of the topics to 50 and epoch number to 500, which are empirical values adopted from ETM. After preprocessing, we further remove one-word documents from the validation and test sets. For all NMT experiments, we train our models on one machine with 4 NVIDIA V100 GPUs and follow Vaswani et al. (2017) base model to set the hyper-parameters with model configurations. The number of parameters is 129M. We compare our topic model against the following models: Fairseq (base) is a sequence modeling toolkit (Ott et al., 2019). BLT-NMT is a topic enhanced model with incorporated bilingual topic knowledge into NMT (Wei et al., 2019). LTR-NMT is a topic-based NMT model using a CNN model .

Results
The experimental results of various existing state-of-the-art (SOTA) models on the same dataset, including Base Transformer and Big Transformer (Vaswani et al., 2017), Evolved Transformer (So et al., 2019), Dynamic Programming Encoding NMT (Li et al., 2020), Phrase Representations Transformer (Xu et al., 2020), are quoted as a reference. For a fair comparison, we list the single best result reported in their papers. The experimental results on EN → DE are depicted in Table 1. Compared to other NMT models, our baseline model based on the Transformer base architecture implemented in Fairseq achieves a BLEU score of 27.44, equivalent to the one for Vaswani et al. (2017). Our topic NMT model achieves 29.01 BLEU scores, significantly outperforming the baseline Fairseq by +1.57 BLEU points. Compared to BLT-NMT and LTR-NMT, our model is +1.08 and + 0.83 BLEU score higher.
To further investigate the effectiveness of our topic NMT model and study the main factor that influences the experiment results, we also compare different topic NMT on the newstest 2014, 2016, 2017, and 2019 dataset for EN → DE and the newstest 2014 for EN → FR. Ablation tests are performed to investigate the effects of three topic embedding options: EN C pre , EN C post , and DEC. The experimental results are shown in Table 2. It is noted that NMT with EN C pre , EN C post , and DEC achieve BLEU improvements of +1.28, +1.52 and +1.15, respectively over the baseline score in the newstest 2014 for EN → DE. The NMT models with four different combinations score +1.31, +1.57, +1.55, +1.54 BLEU points higher than that of the baseline in the newstest 2014. It can be observed that almost all experiments achieve higher BLEU scores over those of the baselines across different test sets. A consistent finding is confirmed in the EN → FR translation direction, indicating the effectiveness of the proposed method.
Examples of topic-enhanced NMT for EN → DE are shown in Table 3. For example, the base NMT model mistranslates "Systematic Theology" to "Systemtheorie" (systems theory in English), which is accurately translated to "Systematische Theologie" by the topic-enhanced NMT model.

Conclusion
In this paper, we propose heterogeneous ways of incorporating topic information as prior knowledge into the Transformer architecture to improve trans-  lation performance. The topic information can be incorporated as pre-encoder topic embedding, postencoder topic embedding, and decoder topic embedding. Experimental results demonstrate that the proposed method can significantly improve translation quality by boosting the BLEU scores over the Transformer baselines on the English → German and English → French translation tasks.