Guiding Neural Machine Translation with Semantic Kernels

,


Introduction
Machine Translation has been a long-standing task in natural language processing (Brown et al., 1990).Recently, Neural-based Machine Translation (NMT) models (Bahdanau et al., 2015;Wu et al., 2016;Vaswani et al., 2017) have made great progress and become the mainstream of machine translation frameworks.Most NMT models adopt the encoder-decoder framework.The encoder transforms the source sentence into source-side global representations.And the decoder generates the target sentence auto-regressively, based on the sourceside representations and translation history."red", "blue" and "purple" color indicate source space, target space and semantic space, respectively.In (b), "D" means the draft generated by the first decoder.In (c), "Net" denotes the inference model in Semantic-based model and "S" is the semantic embedding.(d) shows our SKAM model, where "K S " and "K T " represent source and target semantic kernels, respectively."Proj" is our projector.
However, one limitation of such auto-regressive decoding is that the generation of word y t only has access to target-side partial information y <t .If translation history is mistranslated, this error will be propagated to all subsequent words (Bengio et al., 2015).Also, this makes the generation heavily dependent on the source sentence, and minor changes in source sentence may lead to dramatic degradation in translation outcome (Cheng et al., 2019).Intuitively, using target-side global information to guide translation progress can alleviate this problem.
Attempts have been made to apply global information to guide the decoding process.Basically, we categorize them into two main lines.One is draft (Xia et al., 2017;Wang et al., 2019;Li et al., 2018;Zhang et al., 2018;Zhou et al., 2019), which generates a coarse target sequence to guide the translation progress, as depicted in Figure 1 (b).However, a coarse draft sentence requires delicate design to be generated.Thus, these methods often require multiple decoding steps.The other one is latent semantics (Shah and Barber, 2018;Zheng et al., 2020;Ai and Fang, 2021;Eikema and Aziz, 2019;Zhang et al., 2016;Su et al., 2018), which adopts generative methods ( i.e., VAE (Kingma and Welling, 2014) ) to model the semantics of source and target sentences in the latent semantic space.As in Figure 1 (c), such methods usually project semantics into one fixed-length vector, which shows limitations in expressing semantics for long sentences.Although above methods have successfully injected global information into decoding progress, they both incur extra computational cost, which greatly degrades the inference time compared to vanilla transformer model.
Motivated by the Functional Equivalence Theory (Nida and Taber, 1982), we propose Semantic Kernels with Adaptive Mask (SKAM) for NMT.To guide translation, we extract several semantic kernels from source sentence, each of which can express one semantic segment of the original sentence, as shown in Figure 1 (d).All semantic kernels together can capture the essential meaning of the source sentence, and they are later mapped from source space to target space with N -gram smoothing loss as target-side global information.We also improve auto-regressive decoding with an adaptive mask mechanism to guarantee the usage of semantic kernels in decoding progress.We evaluate the performance on several MT benchmarks that cover various data scales, languages and domains.Experiments show that our approach achieves significant improvement compared to the baselines and is about 1.7 times faster at inference than previous works on average.In total, our contributions can be summarized as: • Inspired by Functional Equivalence Theory, we extract several semantic kernels from a source sentence to capture source semantics, which express sentence semantics at a new granularity.
• To map semantic kernels from source-side to target-side, we propose an N -gram smoothing loss, which guarantees each semantic kernel to capture one semantic segment, not one specific word.
• We design an adaptive mask mechanism to guarantee each decoding step can access comprehensive information, both preceding words (translation history) and subsequent words (semantic kernels).
2 Preliminaries and Related Work

Functional Equivalence Theory
The main point of Functional Equivalence Theory (Nida and Taber, 1982) is that translation should focus on the functional equivalence of information (sense-for-sense translation) rather than the direct formal equivalence (word-for-word translation).To do this, Nida and Taber (1982) proposes a translation framework, which consists of three parts: Decompose: To get rid of the complex and ambiguous structure of the source sentence, the source sentence is split into several simple, short sentences, each of which captures one semantic segment of the original sentence.These simple sentences are called "kernel sentences", based on Transformational Generative Grammar (Chomsky, 2009).
Transfer: The kernel sentences are translated into receptor language.For the simplicity of the kernel sentences, they can be translated easily.And the translated kernel sentences can capture all source semantics, since languages agree far more on the level of the kernel sentences than on the level of the more elaborate structures (Nida and Taber, 1982).
Restructure: Transferred kernel sentences are restructured semantically and stylistically into the surface structure of target language.
Inspired by this theory, we try to make the translation comply more with source sentence meanings 'than source words in NMT model.Hence, we propose SKAM, which first decomposes source sentence to form semantic kernels (Kernel Selection Module), then transfers the semantic kernels into target embedding space (Kernel Projection Module), and finally restructures to a target sentence (Decoding Module).

Neural Machine Translation
Formally, let X = {x 0 , x 1 , ..., x I } and Y = {y 0 , y 1 , ..., y J } denote a source and a target sequence respectively, where I and J are the sentence lengths.Given a bilingual sentence pair ⟨X, Y ⟩, an NMT model learns a set of parameters Θ to maximize the posterior probability P (Y |X; Θ): where y <t is the partial translation that contains the target tokens before position t.

Transformer
Transformer model is based solely on attention mechanism.Given query Q, key K and value V, the output ATT(Q, K, V) is calculated as: where √ d is the scaling factor with d being the dimension of embedding size.
Transformer model employs multiple-layer encoder and decoder to perform the translation task with residual connections among layers.Denote the output of the l-th layer as H l , the encoder calculates: where LN(•) and FFN(•) are layer normalization and feed-forward networks with ReLU activation in between.As all of the Q, K, V come from the same place, this attention is referred to as self-attention.
The decoder is similar in structure to the encoder except that it includes another attention mechanism, called cross-attention, which attends to the output of the encoder stack H L e : where the top layer of the decoder H L d is used to generate the final output sequence.

Target Information Enhanced NMT
Some impressive works have considered adding target information for better translation quality.Most closely related to our work are Deliberation Network (Xia et al., 2017) and Soft-prototype (Wang et al., 2019).These methods first generate a coarse draft to guide translation progress.Their main idea is to deliberate the wrong parts in the previous decoding step.Some other works have adopted bidirectional decoding (Li et al., 2018;Zhang et al., 2018;Zhou et al., 2019) or multi-pass decoding (Geng et al., 2018).Ma et al. (2018) applies target bag of words as targets to train NMT model.In comparison, our motivation is to extract semantic kernels that capture the essential meanings of the source sentence, and replenish these semantic segments to form a final target sentence.
Also related are the works of Zheng et al. (2020); Ai and Fang (2021); Shah and Barber (2018); Zhang et al. (2016); Su et al. (2018), which apply generative methods (VAE (Kingma and Welling, 2014)) to sample latent semantic embedding.Compared with these methods, we select different numbers of semantic kernels according to source sentence and avoid the EM-like decoding progress, which is more expressive and efficient.
In work similar to SKAM, Zhao et al. ( 2018) and Wang et al. (2017) integrate a phrase memory from a phrase-based statistical machine translation (SMT) system to guide the NMT model.Niehues et al. (2016) first adopts a phrase-based SMT system to pre-translate and then generates the final translation with an NMT model.However, these methods can not work without an SMT system at inference time, which limits their usage for translation.

NMT with Semantic Kernels
To make NMT model comply more with source sentence meaning than source sentence form, we propose SKAM, which consists of three modules: kernel selection module, kernel projection module, and decoding module, as depicted in Figure 2. We will explain each module in the following section.

Semantic Kernels Selection
Semantic kernels aim at capturing the essential meaning of the source sentence, and each of them should contain a semantic segment of the original sentence.Following Nida and Taber (1982), which claims that words acquire meaning through their context, we apply the contextual embedding of the content words to represent semantic kernels.Formally, semantic kernels are defined as: (5) where ENC denotes transformer encoder and s(•) is a norm-based significance score to locate the content words of the source sentence.To be mentioned, this definition of semantic kernel is simple, we will try to extract semantic kernels directly from the latent semantic space in future works.

Norm-based Significance Score
The significance score measures the ability of words to express essential meaning using the L2- norm of the word embedding.Intuitively, words that have higher L2-norms will play a leading role when adding up all word embeddings to form a sentence embedding.This feature of L2-norm has already been proven by some promising previous works (Luhn, 1958;Chen et al., 2020a;Liu et al., 2020).We use the embedding matrix in our model to calculate L2-norm.As the norm of embedding matrix varies during training process, we scale each word norm ||x i || with the current largest word norm max v∈V S (||v||) in source embedding.Our significance score s(•) is formulated as: where γ ∈ [0, 1] is a norm threshold value.We only choose words whose score s(x i ) is larger than γ as content words.To better understand what kinds of words are selected by Norm-based Significance Score, we sample some cases and illustrate them in Appendix A.

Semantic Kernels Projection
We try to apply a projector to map source-side semantic kernels K S to target-side K T : where f S→T is a neural projector, K S , K T ∈ R Q×d , Q is the number of semantic kernels and d means embedding size.
For words acquire meaning through their context (Nida and Taber, 1982), we train the projector to predict both content words and their context to better capture the deep meaning beneath surface expression.We propose N -gram smoothing loss to train the projector to concentrate on representing meaning, not a specific word.

N -gram Smoothing Loss
Given the encoder output of each source word ENC(x i |X), the Projector is trained to predict the corresponding target N -gram span Span(y i ).We apply external alignment tool to find the aligned target word ỹi and group every N consecutive target words as an N -gram span.Formally, where k = (N − 1)/2 and N is a hyper-parameter to control how many words we select each time.The N -gram span is then used as label to train the projector with ENC(x i |X) as input.The N -gram smoothing loss L g for one sample X formulates: The output word embedding matrix in projector shares the same parameters with decoder and it is removed at inference time, as shown in Figure 2.

Decoding with Semantic Kernels
To give decoding progress comprehensive targetside information, we modify the original selfattention module in decoder to adaptive attention module, which can utilize both preceding words (from translation history) and subsequent words (from semantic kernels) to predict.Specifically, we concatenate semantic kernels to the K, V parts of the self-attention module in all decoder layers.

ATT(H
Similar to Zheng et al. (2019), we explicitly separate semantic kernels into two groups: fullyaccessed and not-yet-accessed.As translation progresses, we propose an Adaptive Mask to gradually remove the semantic kernels fully-accessed in translation history.

Adaptive Mask
Assuming 0 means unmask operation and 1 indicates mask operation, the attention mask M for semantic kernels should be like: q is contained in y <t (11) where κ T q ∈ K T and q means the q-th semantic kernel.We use the previous attention score A <t as a measurement whether semantic kernel κ T q has been fully-accessed in translation history.That is to say, if κ T q appears to have the largest attention score at time step t, we assume κ T q is fully-accessed at time step t and mask it in subsequent time steps, as illustrated in Figure 3. Formally, we update attention mask M(κ T q , y t ) based on previous attention mask and attention score: 12) where [∨] is logical operator OR, and A t−1 denotes the attention score at t − 1 time step.To preserve parallel training in transformer, we mask semantic kernel after its aligned target token (from external alignment tool) is generated at training.

Training Strategy
The overall loss function is divided into two parts: a translation loss L D and an N -gram smoothing loss L g for Projector.The overall loss function formulates: where λ ∈ [0, 1] is a hyper-parameter to balance the impact between two losses.Details about Ngram smoothing loss can be found in Sec-3.2.After integrating semantic kernels, the translation loss is like: We set a norm threshold γ to control how strict we choose content words, explained in Sec-3.1.However, the norm calculation made at early stages is usually unreliable.We propose norm threshold annealing, which is computed as e • γ + (1 − e) where e is gradually annealed from 0 to 1 during the first 1/3 of training steps.

Experiments
We conduct experiments on the following benchmarks: NIST Chinese to English (Zh→En), WMT14 English to German (En→De), WMT14 English to French (En→Fr), IWSLT14 English to/from German (En↔De) translation tasks.

Datasets
For WMT 14 En→De, the training corpus is identical to previous work (Wang et al., 2019), which consists of about 4.5M sentence pairs.The validation set is newstest2013 and test set is newstest2014.For WMT 14 En→Fr, this dataset contains 36M sentences.The validation set is the concatenation of newstest2012 and newstest2013.Test results are reported on newstest2014 as (Wang et al., 2019).Following previous work (Yang et al., 2020), IWSLT 14 En→De dataset contains 160k sentence pairs for training and 7584 sentence pairs for validation.The concatenation of validation sets is used as the test set (dev2010, dev2012, tst2010, tst2011, tst2012).For NIST Zh→En, we use the LDC corpus with 1.25M sentence pairs with 27.9M Chinese words and 34.5M English words, respectively.We select the best model using the NIST 2002 as the validation set for model selection and hyperparameter tuning.The NIST 2004 (MT04), 2005 (MT05), 2006 (MT06) and 2008 (MT08) datasets are used as test sets.
We choose the Stanford segmenter (Tseng et al., 2005) for Chinese word segmentation and apply the script tokenizer.pl of Moses (Koehn et al., 2007) for English, French, and German tokenization.All data has been jointly byte pair encoded (BPE) (Sennrich et al., 2016).For WMT/IWSLT, we create a joint vocabulary with 32k and 10k merge operations respectively.For NIST Zh→En, BPEs are learnt separately with 60k operations.

Model Configuration
Fundamental Transformer is implemented with fairseq (Ott et al., 2019).We follow the most common model configuration for each dataset.For IWSLT/NIST/WMT, we use the small/base/big transformer model.In detail, the encoder and decoder include 6 layers.All layers have an embedding size of 512/512/1024, a feed-forward size of 1024/2048/4096 and 4/8/16 attention heads, respectively.In order to prevent overfitting, we use a dropout rate of 0.3 (except for WMT 14 En→Fr, which is 0.1), and label smoothing of 0.1.For IWSLT and NIST, we train the model on a single P100 GPU, with each batch containing 4096 tokens.For WMT, we train the model on 6 P100 GPUs with update frequency set to 2, which results in 2500×6×2 tokens per batch.We average the last 5/20 checkpoints for base/big model and use the checkpoint that has the best valid performance for small model.We use the case-sensitive tokenized BLEU multi-bleu.perl(Papineni et al., 2002) to evaluate WMT tasks and case-insensitive tokenized BLEU mteval-v11b.plfor NIST Zh→En.We report sacrebleu (Post, 2018) results for IWSLT.All experiments are run 4 times and report the average BLEU.
Projector is implemented as transformer encoder with 3 layers.The feed-forward size and attention heads are the same as fundamental transformer for each dataset.After adding projector, the training speed is on average about 80% of the vanilla transformer.For all benchmarks, we set λ = 0.3 heuristically.Norm threshold γ is set to 0.5 and N = 3 in our main experiment unless otherwise specified.We update adaptive mask with attention score from the top layer of decoder.

Baselines
For strictly consistent comparison, we involve the following strong baselines: Transformer (Vaswani et al., 2017) is a strong baseline which we build our model upon.Deliberation Network (Xia et al., 2017) and SoftPrototype (Wang et al., 2019) first generate the draft and polish the draft for the final translation.GNMT (Shah and Barber, 2018), Mirror-GNMT (Zheng et al., 2020)   (Ai and Fang, 2021) sample a latent semantic embedding from semantic space and consider it as global information for decoding.

Results and Comparison
The results for WMT14 En →{De, Fr} and NIST Zh→En are presented in Table 1 and results on IWSLT14 En↔De are in Table 2.For convenience, we refer to our model as "SKAM" in these tables.We summarize the results as: Semantic kernels improve model performance.
Compared with transformer baseline, our approach on all four benchmarks brings substantial improvements, 1.07 BLEU points on average.Our model obtains competitive performance compared with previous methods on several benchmarks, and even surpasses all previous methods with a 29.52 BLEU score on WMT14 En-De benchmark.All results are statistically significant with p < 0.01 in paired bootstrap sampling (Koehn, 2004).
N -gram 0 1 3 5 SKAM 28.95 29.25 29.61 29.43 Semantic kernels are time efficiency.As our semantic kernels are generated in a nonautoregressive way, our model only needs about 17% extra time to generate them.Compared with previous work, our model achieves about 1.7 times faster on average, even 2 times faster than some latent semantic-based methods.

Ablation Study
We perform an ablation study to show the effectiveness of each module on IWSLT14 En↔De benchmarks.The results are shown in Table 2. Specifically, "w/o s(•)" compares our model with a baseline in which the decoder extends its K, V matrix with random parameters.Also, the results show that the improvements mainly come from our design, not an increase in parameters.

Parameter Analysis
Effect of Norm Threshold Norm threshold γ controls how strict we select semantic kernels.In general, the bigger γ is, the fewer words are selected as semantic kernels.To further examine the impact of norm threshold γ, we conduct experiments on IWSLT14 En→De benchmark.From the results, we find that when γ < 0, 5, the performance increases, for we filter out more and more irrelevant words in expressing semantics.When γ > 0.5, performance gradually decreases and the model eventually deteriorates to transformer baseline.
Effect of N -gram We also test the impact of Ngram smoothing supervision on the Projector and depict the results in Table 3. Intuitively, the bigger N is, the better to disambiguate each word while the smaller N is, the better the discrepancy among each representation.From Table 3, we find that N -gram smoothing loss is critical to Projector and N = 3 is a balance point between the discrepancy and disambiguation.

Performance w.r.t Sentence Length
Following previous work (Wang et al., 2019), we divide source sentences into different groups according to sentence length and compute the BLEU score separately for each group on WMT14 En→De task, as shown in Figure 4. Generally, the longer the source sentence is, the more influential semantic kernels are.This demonstrates that semantic kernels are especially helpful for the generation of longer sentences.

Case Study
We present examples from WMT 14 En→De task to illustrate the impact of semantic kernels, shown in Table 4, including source sentence, the gold target sentence (reference), translation generated by the vanilla Transformer model (Transformer) and translation given by ours (SKAM).From Table 4, we find that semantic kernels can help transformer baseline in two ways: Select Words More Appropriately.In the first example, nachdenken is a more appropriate translation of think than Denken from Transformer.Similarly, in the second example, Transformer mistranslates lower into unten (bottom).We conjecture that the semantic kernels can help our model focus on meanings not word forms.
Capture Source Semantics More Comprehensively.In the first example, the sentence piece So I want you is missing by transformer, while SKAM successfully captures this meaning.This circumstance can also be found in the second example, where Bottom line is that is missing in transformer.This implies that SKAM is particularly helpful for the generation of longer and harder sentences.However, SKAM still shows some limitations.In the first example, the meaning daher (so) is missing in SKAM.More cases can be found in Appendix A.

Conclusion
Following Functional Equivalence Theory, we propose Semantic Kernels with Adaptive Decoding, which extracts several semantic kernels and projects them into target embedding space to guide translation.We propose adaptive mask mechanism to enable each decoding step to access target-side global information.Several empirical results reveal that our SKAM is both expressive in semantics and efficient in time.
Our way of representing kernel sentences in NMT is intuitive and simple.In future work, we would like to explore better methods to capture sentence semantics.

Limitations
As we tentatively give a successful implementation of leveraging Functional Equivalence Theory into Neural Machine Translation framework, such paradigm deserves a further and more detailed exploration.First, our representation of semantic kernels is quite intuitive and simple, how to align semantics between source and target languages is still challenging and thrilling, yet still in its fledgeless stage.Aside from it, while extensive experiments demonstrate that SKAM consistently improves translation quality, applying our approach on other language generation tasks will evaluate the effectiveness of our work in a more general way.

A More Analysis on Norm-based
Significance Score more sentences from WMT14 En→De benchmark and present them in Table 6.
A.1 Words Selected by Norm-based Significance Score In Table 6, we show the words selected by Normbased Significance Score as "Keywords".As you can tell, our Norm-based Significance Score tends to select content words from source sentences.Though some prepositions and conjunctions are wrongly selected, most words selected by Normbased Significance Score are content words.

A.2 Impact of Semantic Kernels
From Table 6, we can tell that before applying semantic kernels, some colored sentence pieces are not covered in translation results, while after applying semantic kernels, the translation results are more complete.Also, in the first two cases, applying semantic kernels further helps our model translate words more accurately.From the results, it is clear that semantic kernels help transformer model obtain a more comprehensive view of the source sentence.

B More Results on WMT Benchmarks
We also report the results on WMT19, WMT20, WMT21 En→De newstest benchmarks.We build SKAM model upon Transformer Big baseline and the model is trained on 282M bilingual language pairs, which is the combination of all parallel data released by WMT21.All words are split into subword units with 40k merge operations.The model is trained on 8 V100 (16G) GPUs with a batch size of 48k tokens in total (3000 × 8 × 2).We trained 10 epochs and averaged the last 5 checkpoints.The results are reported in Table 5. SKAM outperforms transformer baseline with 0.86 BLEU score on average on these 3 benchmarks.

Figure 1 :
Figure1: Comparison among methods with target-side global information."red", "blue" and "purple" color indicate source space, target space and semantic space, respectively.In (b), "D" means the draft generated by the first decoder.In (c), "Net" denotes the inference model in Semantic-based model and "S" is the semantic embedding.(d) shows our SKAM model, where "K S " and "K T " represent source and target semantic kernels, respectively."Proj" is our projector.

Figure 3 :
Figure 3: An illustration of our adaptive mask mechanism.The white "M" indicates the maximum attention score at current time step.After one semantic kernel gets the highest attention score, we mask it in the subsequent decoding step.
(2019);Zheng et al. (2020);Ai and Fang (2021), respectively.Numbers marked with * are from our implementation."Params" denotes the number of model parameters for En→De."Time Ratio" is calculated as the ratio of inference time between each model and transformer baseline.

Figure 4 :
Figure 4: BLEU scores according to the sentence length.Results are on WMT14 En→De.Apparently, the longer the sentence, the better the performance that SKAM outperforms Transformer baseline.

Figure 5 :
Figure 5: Test of different norm thresholds γ on IWSLT14 En→De.γ = 0 means that all source words are treated as semantic kernels, while γ = 1 indicates no semantic kernels are selected at all.

Table 1 :
Results on WMT14 En→De, WMT14 En→Fr and NIST Zh→En translation tasks.Results marked + , ‡ , † are from Wang et al.

Table 2 :
Results on IWSLT14 En↔De translation tasks and Ablation Study.Avg.∆ means the gap between each model setting and SKAM."w/o s(•)" means the semantic kernels are selected randomly from source sentences.

Table 3 :
Test of our N -gram smoothing supervision.The experiments are conducted on IWSLT14 En → De.N = 0 means no supervision is applied on Projector module.

Table 4 :
Translation examples extracted from WMT 14 En→De task."Keywords" denotes the words selected by our Norm-based Significance Score.The same color across different sentences refers to the same aligned sentence piece.

Table 5 :
To give a better view of what kinds of words are selected by Norm-based Siginificance Score and how these words affect translation progress, we sample Results on WMT19, WMT20, WMT21 En→De newstest benchmarks.