Enriching Non-Autoregressive Transformer with Syntactic and Semantic Structures for Neural Machine Translation

The non-autoregressive models have boosted the efficiency of neural machine translation through parallelized decoding at the cost of effectiveness, when comparing with the autoregressive counterparts. In this paper, we claim that the syntactic and semantic structures among natural language are critical for non-autoregressive machine translation and can further improve the performance. However, these structures are rarely considered in the existing non-autoregressive models. Inspired by this intuition, we propose to incorporate the explicit syntactic and semantic structure of languages into a non-autoregressive Transformer, for the task of neural machine translation. Moreover, we also consider the intermediate latent alignment within target sentences to better learn the long-term token dependencies. Experimental results on two real-world datasets (i.e., WMT14 En-De and WMT16 En- Ro) show that our model achieves a significantly faster speed, as well as keeps the translation quality when compared with several state-of-the-art non-autoregressive models.


Introduction
Recently, non-autoregressive models (Gu et al., 2018), which aim to enable the parallel generation of output tokens without sacrificing translation quality, have attracted much attention. Although the non-autoregressive models have considerably sped up the inference process for real-time neural machine translation (NMT) (Gu et al., 2018), their performance is considerably worse than that of autoregressive counterparts. Most previous works attribute the poor performance to the inevitable conditional independence issue when predicting target tokens, and many variants have been proposed to solve it. For example, several techniques in nonautoregressive models are investigated to mitigate the trade-off between speedup and performance, including iterative refinement (Lee et al., 2018), insertion-based models , latent variable based models (Kaiser et al., 2018;Shu et al., 2020), CTC models (Libovický and Helcl, 2018;Saharia et al., 2020), alternative loss function based models (Wei et al., 2019;Shao et al., 2020), and masked language models (Ghazvininejad et al., 2019(Ghazvininejad et al., , 2020. Although these works have tried to narrow the performance gap between autoregressive and nonautoregressive models, and have achieved some improvements on machine translation, the nonautoregressive models still suffer from syntactic and semantic limitations. That is, the translations of non-autoregressive models tend to contain incoherent phrases (e.g., repetitive words), and some informative tokens on the source side are absent. It is because in non-autoregressive models, each token in the target sentence is generated independently. Consequently, it will cause the multimodality issue, i.e., the non-autoregressive models cannot model the multimodal distribution of target sequences properly (Gu et al., 2018).
One key observation to mitigate the syntactic and semantic error is that source and target translated sentences follow the same structure, which can be reflected from Part-Of-Speech (POS) tags and Named Entity Recognition (NER) labels. Briefly, POS, which aims to assign labels to words to indicate their categories by considering the longdistance structure of sentences, can help the model learn the syntactic structure to avoid generating the repetitive words. Likewise, NER, which discovers the proper nouns and verbs of sentences, naturally helps the model recognize some meaningful semantic tokens that may improve translation quality. This observation motivates us to leverage the syntactic as well as semantic structures of natural language to improve the performance of non- In this paper, we propose an end-to-end Syntactic and semantic structure-aware Non-Autoregressive Transformer model (SNAT) for NMT. We take the structure labels and words as inputs of the model. With the guidance of extra sentence structural information, the model greatly mitigates the multimodality issue's negative impact. The core contributions of this paper can be summarized as that we propose 1) a syntax and semantic structure-aware Transformer which takes sequential texts and the structural labels as input and generates words conditioned on the predicted structural labels, and 2) an intermediate alignment regularization which aligns the intermediate decoder layer with the target to capture coarse target information. We conduct experiments on four benchmark tasks over two datasets, including WMT14 En→De and WMT16 En→Ro. Experimental results indicate that our proposed method achieves competitive results compared with existing state-of-the-art nonautoregressive and autoregressive neural machine translation models, as well as significantly reduces the decoding time.

Background
Regardless of its convenience and effectiveness, the autoregressive decoding methods suffer two major drawbacks. One is that they cannot generate multiple tokens simultaneously, leading to ineffi-cient use of parallel hardware such as GPUs. The other is that beam search has been found to output low-quality translation with large beam size and deteriorates when applied to larger search spaces. However, non-autoregressive transformer (NAT) could potentially address these issues. Particularly, they aim at speeding up decoding through removing the sequential dependencies within the target sentence and generating multiple target tokens in one pass, as indicated by the following equation: wherex = {x 1 , . . . ,x m } is the copied source sentence. Since the conditional dependencies within the target sentence (y t depends on y <t ) are removed from the decoder input, the decoder is unable to leverage the inherent sentence structure for prediction. Hence the decoder is supposed to figure out such target-side information by itself given the source-side information during training. This is a much more challenging task compared to the autoregressive counterparts. From our investigation, we find the NAT models fail to handle the target sentence generation well. It usually generates repetitive and semantically incoherent sentences with missing words. Therefore, strong conditional signals should be introduced as the decoder input to help the model better learn internal dependencies within a sentence.

Methodology
In this section, we present our model SNAT to incorporate the syntactic and semantic structure information into a NAT model as well as an intermediate latent space alignment within the target. Figure 1 gives an overview of the network structure of our proposed SNAT. In SNAT, the input sequence is segmented into sub-words by byte-pair  tokenizer (Sennrich et al., 2016). In parallel, words in the input sequence are passed to POS and NER annotators to extract explicit syntactic and semantic structures, and the corresponding embeddings are aggregated by a linear layer to form the final syntax and semantic structure-aware embedding. The SNAT model copies the structured encoder input as the decoder input and generates the translated sentences and labels. One of the most important properties of SNAT is that it naturally introduces syntactic and semantic information when taking the structure-aware information as inputs and generating both words and labels. More precisely, given a source sentence x, as well as its label sequence L x , the conditional probability of a target translation y and its label sequence L y is: where x and L x are first fed into the encoder of SNAT model.x andL x with length m are syntactic and semantic structure-aware copying of word and label from encoder inputs, respectively. We show the details in the following sections.

Syntactic and Semantic Labeling
We use POS and NER to introduce the syntactic and semantic information existing in natural language, respectively. During the data pre-processing, each sentence is annotated into a semantic sequence using an open-source pre-trained semantic annotator.
In particular, we take the Treebank style (Marcus et al., 1999) for POS and PropBank style (Palmer et al., 2005) for NER to annotate every token of input sequence with semantic labels. Given a specific sentence, there would be predicate-argument structures. Since the input sequence is segmented into subword units using byte-pair encoding (Sennrich et al., 2016), we assign the same label to all subwords tokenized from the same word. As shown in Figure 1, the word "Ancelotti" is tokenized as "An@@" and "Celotti".

Encoder
Following Transformer (Vaswani et al., 2017), we use a stack of 6 identical Transformer blocks as encoder. In addition to the word embedding and position embedding in the traditional Transformer, we add structure-aware label embedding. The input to the encoder block is the addition of the normalized word, labels (NER and POS) and position embedding, which is represented as is encoded into contextual layer representations through the Transformer blocks. For each layer, the layer representation H l = [h l 1 , . . . , h l n ] is computed by the l-th layer Transformer block H l = Transformer l (H l−1 ), l ∈ {1, 2, . . . , 6}. In each Transformer block, multiple self-attention heads are used to aggregate the output vectors of the previous layer. A general attention mechanism can be formulated as the weighted sum of the value vector V using the query vector Q and the key vector K: where d model represents the dimension of hidden representations. For self-attention, Q, K, and V are mappings of previous hidden representation by different linear functions, i.e., respectively. At last, the encoder produces a final contextual representation H 6 = [h 6 1 , . . . , h 6 n ], which is obtained from the last Transformer block.

Decoder
The decoder also consists of 6 identical Transformer blocks, but with several key differences from the encoder. More concretely, we denote the contextual representations in the i-th decoder layer , which is produced by the addition of the word, labels (NER and POS) copying from encoder input and positional embedding.
For the target side input [x,L x ], most previous works simply copied partial source sentence with the length ratio n m where n refers to the source length and m is the target length as the decoder input. More concretely, the decoder input y i at the i-th position is simply a copy of the n m × i th contextual representation, i.e., x n m ×i from the encoder. From our investigation, in most cases, the gap between source length and target length is relatively small (e.g. 2). Therefore, it deletes or duplicates the copy of the last a few tokens of the source. If the last token is meaningful, the deletion will neglect important information. Otherwise, if the last token is trivial, multiple copies will add noise to the model. Instead, we propose a syntactic and semantic structure-aware mapping method considering the POS and NER labels to construct the decoder inputs. Our model first picks out the informative words with NOUN and VERB POS tags, and the ones recognized as entities by the NER module. If the source length is longer than the target length, we retain all informative words, and randomly delete the rest of the words. On the other hand, if the source length is shorter than the target, we retain all words and randomly duplicate the informative words. The corresponding label of a word is also deleted or preserved. Moreover, by copying the similar structural words from the source, it can provide more information to the target input than just copying the source token, which is greatly different from the target token. The POS and NER labels of those structure-aware copied words from the source sentence are also copied as the decoder input. So by using the structure-aware mapping, we can assign [x,L x ] as decoder input.
For positional attention which aims to learn the local word orders within the sentence (Gu et al., 2018), we set positional embedding (Vaswani et al., 2017) as both Q and K, and hidden representations of the previous layer as V.
For inter-attention, Q refers to hidden representations of the previous layer, whereas K and V are contextual vectors H 6 from the encoder. We modify the attention mask so that it does not mask out the future tokens, and every token is dependent on both its preceding and succeeding tokens in every layer. Therefore, the generation of each token can use bi-directional attention. The positionwise Feed-Forward Network (FFN) is applied after multi-head attention in both encoder and decoder. It consists of two fully-connected layers and a layer normalization (Ba et al., 2016). The FFN takes Z 6 as input and calculates the final representation Z f , which is used to predict the whole target sentence and label: where f is a GeLU activation function (Hendrycks and Gimpel, 2016). W w and W l are the token embedding and structural label embedding in the input representation, respectively. We use different FFNs for POS and NER labels. To avoid redundancy, we just use q L y |x,L x , x, L x to represent the two predicted label likelihood in general.

Training
We use (x, L x , y * , L * y ) to denote a training instance. To introduce the label information, our proposed SNAT contains a discrete sequential latent variable L y 1:m with conditional posterior distribution p(L y 1:m |x,L x , x, L x ; ϕ). It can be approximated using a proposal distribution q(L y | x,L x , x, L x ). The approximation also provides a variational bound for the maximum log-likelihood: Note that, the resulting likelihood function, consisting of the two bracketed terms in Eq. (6), allows us to train the entire model in a supervised fashion. The inferred label can be utilized to train the label predicting model q and simultaneously supervise the structure-aware word model p. The label loss can be calculated by the cross-entropy H of L * yt and Eq. (5): The structure-aware word likelihood is conditioned on the generation result of the label. Since the Eq. (4) does not depend on the predicted label, we propose to bring the structure-aware word mask M wl ∈ R |V word |×|V label | , where |V word | and |V label | are vocabulary sizes of word and label, respectively. The mask M w l is defined as follows: which can be obtained at the preprocessing stage, and A denotes the open-source pre-trained POS or NER annotator mentioned above. It aims to penalize the case when the word y i does not belong to the label label j with , which is a small number defined within the range of (0, 1) and will be tuned in our experiments. For example, the word "great" does not belong to VERB. The structure-aware word likelihood can be reformulated as: Consequently, the structure-aware word loss L word is defined as the cross-entropy between true p (y * t |L * yt ) and predicted p(y t | L yt ,x,L x , x, L x ; ϕ), where p (y * t |L * yt ) ∈ R |V word |×|V label | is a matrix where only item at the index of (y * t , L * yt ) equals to 1, otherwise equals to 0. We reshape p (y * t |L * yt ) and p(y t |L yt ) to vectors when calculating the loss.
Intermediate Alignment Regularization One main problem of NAT is that the decoder generation process does not depend on the previously generated tokens. Based on the bidirectional nature of SNAT decoder, the token can depend on every token of the decoder input. However, since the input of decoder [x,L x ] is the duplicate of encoder input [x, L x ], the generation depends on the encoder tokens rather than the target y * .
To solve this problem, we align the output of the intermediate layer of the decoder with the target. The alignment makes the generation of following layers dependent on the coarse target-side information instead of the mere encoder input. This alignment idea is inspired by (Guo et al., 2019), which directly feeds target-side tokens as inputs of the decoder by linearly mapping the source token embeddings to target embeddings. However, using one FFN layer to map different languages to the same space can hardly provide promising results. Thus, instead of aligning the input of decoder with the target, we use the intermediate layer of decoder to align with the target. In this case, our model avoids adding additional training parameters and manages to train the alignment together with SNAT model in an end-to-end fashion. Formally, we define the intermediate alignment regularization as cross-entropy loss between the predicted word and the true word: where Z md (1 < md < 6) represents the output of each intermediate layer. Consequently, the final loss of SNAT can be represented with the coefficient λ as:

Experiment
In this section, we conduct experiments to evaluate the effectiveness and efficiency of our proposed model, with comprehensive analysis.  (Wu et al., 2016); CNN-based results are from (Gehring et al., 2017). † The Transformer (Vaswani et al., 2017) results are based on our own reproduction.

Experimental Setup
Data We evaluate SNAT performance on both the WMT14 En-De (around 4.5M sentence pairs) and the WMT16 En-Ro (around 610k sentence pairs) parallel corpora. For the parallel data, we use the processed data from (Ghazvininejad et al., 2019) to be consistent with previous publications. The dataset is processed with Moses script (Hoang and Koehn, 2008), and the words are segmented into subword units using byte-pair encoding (Sennrich et al., 2016 2017) for all the models: 6 layers per stack, 8 attention heads per layer, 512 model dimensions and 2,048 hidden dimensions. The dimension of POS and NER embedding is set to 512 which is the same as the word embedding dimension. The autoregressive and non-autoregressive model have the same encoder-decoder structure, except for the decoder attention mask and the decoding input for the nonautoregressive model as described in Sec. 3. We try different values for the label mismatch penalty from {0.01, 0.1, 0.5} and find that 0.1 gives the best performance. The coefficient λ is tested with different values from {0.25, 0.5, 0.75, 1}, and λ = 0.75 outperforms other settings. We set the initial learning rate as values from {8e-6, 1e-5, 2e-5, 3e-5}, with a warm-up rate of 0.1 and L2 weight decay of 0.01. Sentences are tokenized and the maximum number of tokens in each step is set to 8,000. The maximum iteration step is set to 30,000, and we train the model with early stopping.
Baselines We choose the following models as baselines: NAT is a vanilla non-autoregressive Transformer model for NMT which is first introduced in (Gu et al., 2018). iNAT (Lee et al., 2018) extends the vanilla NAT model by iteratively read-ing and refining the translation. The number of iterations is set to 10 for decoding. Hint-NAT (Li et al., 2020) utilizes the intermediate hidden states from an autoregressive teacher to improve the NAT model. FlowSeq (Ma et al., 2019) adopts normalizing flows (Kingma and Dhariwal, 2018) as latent variables for generation. ENAT (Guo et al., 2019) proposes two ways to enhance the decoder inputs to improve NAT models. The first one leverages a phrase table to translate source tokens to target tokens ENAT-P. The second one transforms sourceside word embedding into target-side word embeddings ENAT-E. DCRF-NAT  designs an approximation of CRF for NAT models and further uses a dynamic transition technique to model positional context in the CRF. NAR-MT (Zhou and Keung, 2020) uses a large number of source texts from monolingual corpora to generate additional teacher outputs for NAR-MT training. AXE CMLM (Ghazvininejad et al., 2020) trains the conditional masked language models using a differentiable dynamic program to assign loss based on the best possible monotonic alignment between target tokens and model predictions.

Training and Inference Details
To obtain the part-of-speech and named entity labels, we use industrial-strength spaCy 2 to acquire the label for English, German, and Romanian input. Knowledge Distillation Similar to previous works on non-autoregressive translation (Gu et al., 2018;Shu et al., 2020; 2019), we use sequence-level knowledge distillation by training SNAT on translations generated by a standard left-to-right Transformer model (i.e., Transformer-large for WMT14 EN→DE, and Transformer-base for WMT16 EN→RO). Specifically, we use scaling NMT (Ott et al., 2018) as the teacher model. We report the performance of standard autoregressive Transformer trained on distilled data for WMT14 EN→DE and WMT16 EN→RO. We average the last 5 checkpoints to obtain the final model. We train the model with cross-entropy loss and label smoothing ( = 0.1).
Inference During training, we do not need to predict the target length m since the target sentence is given. During inference, we use a simple method to select the target length for SNAT Li et al., 2020). First, we set the target length to m = n + C, where n is the length of the source sentence and C is a constant bias term estimated from the overall length statistics of the training data. Then, we create a list of candidate target lengths with a range of [m − B, m + B] where B is the half-width of the interval. Finally, the model picks the best one from the generated 2B + 1 candidate sentences. In our experiments, we set the constant bias term C to 2 for WMT 14 EN→DE, -2 for WMT 14 DE→EN, 3 for WMT 16 EN→RO, and -3 for WMT 14 RO→EN according to the average lengths of different languages in the training sets. We set B to 4 or 9, and obtain corresponding 9 or 19 candidate translations for each sentence. Then we employ an autoregressive teacher model to rescore these candidates.

Results and Analysis
Experimental results are shown in Table 2. We first compare the proposed method against autoregressive counterparts in terms of translation quality, which is measured by BLEU (Papineni et al., 2002). For all our tasks, we obtain results comparable with the Transformer, the state-of-the-art autoregressive model. Our best model achieves 27.50 (+0.02 gain over Transformer), 30.82 (-0.46 gap with Transformer), 35.19 (+0.82 gain), and 33.98 (+0.16 gain) BLEU score on WMT14 En↔De and WMT16 EN↔Ro, respectively. More importantly, our SNAT decodes much faster than the Transformer, which is a big improvement regarding the speed-accuracy trade-off in AT and NAT models.
Comparing our models with other NAT models, we observe that the best SNAT model achieves a significant performance boost over NAT, iNAT, Hint-NAT, FlowSeq, ENAT, NAR-MT and AXE CMLM by +8.33, +5.96, +6.39, +6.05, +3.22, 3.93 and +3.97 in BLEU on WMT14 En→De, respectively. This indicates that the incorporation of the syntactic and semantic structure greatly helps reduce the impact of the multimodality problem and thus narrows the performance gap between Autoregressive Transformer (AT) and Non-Autoregressive Transformer (NAT) models. In addition, we see a +0.69, +0.78, +0.68, and 0.52 gain of BLEU score over the best baselines on WMT14 En→De, WMT14 De→En, WMT16 En→Ro and WMT16 Ro→En, respectively.
From the result of our methods at the last group in Table 2, we find that the rescoring technique substantially assists in improving the performance, and dynamic decoding significantly reduces the time spent on rescoring while further accelerating the decoding process. On En→De, rescoring 9 candidates leads to a gain of +2.23 BLEU, and rescoring 19 candidates gives a +2.86 BLEU score increment.
Decoding Speed Following previous works (Gu et al., 2018;Lee et al., 2018;Guo et al., 2019), we evaluate the average per-sentence decoding latency on WMT14 En→De test sets with batch size being 1, under an environment of NVIDIA Titan RTX GPU for the Transformer model and the NAT models to measure the speedup. The latencies are obtained by taking an average of five runs. More clearly, We reproduce the Transformer on our machine. We copy the runtime of other models but the speedup ratio is between the runtime of their implemented Transformer and their proposed model. We think it's reasonable to compare the speedup ratio because it is independent of the influence caused by different implementation software or machines. And to clarify, the latency does not include preprocessing of tagging, because it's a very fast process as executing around 7000 sentences in one second.
We can see from Table 2 that the best SNAT gets a 9.3 times decoding speedup than the Transformer, while achieving comparable or even better performance. Compared to other NAT models, we observe that the SNAT model is almost the fastest (only a little bit behind of ENAT and Hint-NAT) in terms of latency, and is surprisingly faster than DCRF-NAT with better performance.

Ablation Analysis
Effect of Syntactic and Semantic Structure Information We investigate the effect of using the syntactic and semantic tag on the model performance. Experimental results are shown in Table 3.  Effect of Sentence Length To evaluate different models on different sentence lengths, we conduct experiments on the WMT14 En→De development set and divide the sentence pairs into different length buckets according to the length of the reference sentences. As shown in Table 5, the column of 100 calculates the BLEU score of sentences that the length of the reference sentence is larger than 50 but smaller or equal to 100. We can see that the performance of vanilla NAT drops quickly as the sentence length increases from 10 to 50, while AT model and the proposed SNAT model have relatively stable performance over different sen- tence lengths. This result confirms the power of the proposed model in modeling long-term token dependencies.

Conclusion
In this paper, we have proposed a novel syntactic and semantic structure-aware non-autoregressive Transformer model SNAT for NMT. The proposed model aims at reducing the computational cost in inference as well as keeping the quality of translation by incorporating both syntactic and semantic structures existing among natural languages into a non-autoregressive Transformer. In addition, we have also designed an intermediate latent alignment regularization within target sentences to better learn the long-term token dependencies. Comprehensive experiments and analysis on two realworld datasets (i.e., WMT14 En→De and WMT16 En→Ro) verify the efficiency and effectiveness of our proposed approach.