Sequence-to-Lattice Models for Fast Translation

Non-autoregressive machine translation (NAT) approaches enable fast generation by utilizing parallelizable generative processes. The re-maining bottleneck in these models is their decoder layers; unfortunately unlike in autoregressive models (Kasai et al., 2020), remov-ing decoder layers from NAT models signif-icantly degrades accuracy. This work proposes a sequence-to-lattice model that replaces the decoder with a search lattice. Our approach ﬁrst constructs a candidate lattice using efﬁcient lookup operations, generates lattice scores from a deep encoder, and ﬁnally ﬁnds the best path using dynamic program-ming. Experiments on three machine translation datasets show that our method is faster than past non-autoregressive generation approaches, and more accurate than naively reducing the number of decoder layers.


Introduction
Non-autoregressive (NAT) machine translation (Gu et al., 2017) provides multi-fold speedups compared to sequential generation by parallelizing computation across positions. NAT models are often compared to autoregressive models with deep architectures. However, autoregressive models themselves also admit structural changes that can give large speedups, for instance using shallow decoders (Kasai et al., 2020) or pruning the per sentence vocabulary (Jean et al., 2014). Unfortunately porting these structural changes to NAT models significantly hurts accuracy, reducing their benefits.
This work proposes a sequence-to-lattice formulation for translation that yields the benefits of non-autoregressive translation while reducing the practical costs of deep decoder models. The method first constructs a candidate target lattice based on the source sentence using a variant of IBM Model 2 (Brown et al., 1993). Then each lattice edge is scored based on an encoder head. Fi-nally, exact inference is performed with the Viterbi algorithm (Forney, 1973).
Experiments on machine translation benchmarks show that this simple approach achieves fast translation without requiring an ensemble of different approaches. Our code, models, and logs are available at https://github.com/ harvardnlp/cascaded-generation. Gu et al. (2017) proposed the task of nonautoregressive machine translation, and since then there have been many followup works (Lee et al., 2018, inter alia). Among these works, a line of research using structured prediction models is particularly relevant to our approach: Sun et al. (2019) proposed to use a first-order conditional random field (CRF) (Lafferty et al., 2001) to model the dependencies among adjacent target tokens; Deng and Rush (2020) used a cascaded decoding procedure to extend to higher-order CRFs; Su et al. (2021) used BERT to produce the transition scores of a first-order CRF to leverage pretraining. Our approach builds on this line of research but differs in two aspects: first, we use an efficient count-based model to construct the candidate lattice; second, we show that the sequence-to-lattice formulation allows using many fewer lattice scorer layers.

Related Work
Many prior works have considered the problem of reducing the vocabulary for efficient machine translation (Jean et al., 2014;Mi et al., 2016;Shi and Knight, 2017;L'Hostis et al., 2016). The most common approach is based on the intuition that each source word can only be translated into a small set of target words, such that taking their union gives us a reduced vocabulary. In order to find which words each source word translates into, simple statistical models (Dyer et al., 2013) are usually used to get alignments. In this work, we also use a variant of IBM Model 2 to reduce the size of target-side vocabularies, but instead of finding a reduced target vocabulary per sentence, our method finds a reduced target vocabulary per position.

Approach
Our method formulates translation as a first-order conditional Markov model. Given a source sentence x = x 1 , · · · , x S , the goal of translation is to produce the best target sentence y = y 1 , · · · , y T under: The full set of translations can be compactly represented as a lattice where each edge score corresponds to P θ (y t |y t−1 , t, x).
Once a lattice is constructed and scored, the best translation can be computed using the Viterbi algorithm (Forney, 1973). However, the lattice is quadratic in the vocabulary size. In order to make this approach efficient, we need to specify how to: a) generate a tractable lattice, b) score the edges, and c) parallelize the process.
Candidate Lattice Construction To generate a candidate lattice, we propose a simple statistical model. We introduce latent alignments a from target to source (a t ∈ {1, · · · , S}) and factorize the joint distribution of alignments and target, conditioned on source as, P (a, y|x) = T t=1 P (y t |x, a t )P (a t |x).
We make the simplifying assumption to use relative positions P (y t |x, a t ) ≈ P (y t |x at , a t − t), and to ignore the source words in the alignment prior P (a t |x) ≈ P (a t |S, t), yielding, Marginalizing over latent variable a, we have which implies, P (y t |x, t) = at P (y t |x at , a t − t)P (a t |S, t).
Using this equation we can very efficiently generate a position-aware candidate lattice where at each position we keep the top K words with the highest P (y t |x, t) values. This produces a lattice of size K 2 . To train P (y t |x at , a t − t) and P (a t |S, t), we use count-based MLE (without smoothing) with supervised alignments a estimated using FastAlign (Dyer et al., 2013).
Lattice Scoring To combine the best of probabilistic modeling and neural networks, we use a transformer (Vaswani et al., 2017) to parameterize P θ (y t |y t−1 , t, x). We first encode x into a memory bank using a normal transformer encoder, then we use a single-layer head to produce P θ (y t |y t−1 , t, x) for all K values of y t−1 while attending to the memory bank and the target position. Transformers are very suitable for this purpose because at training, we only need to modify the standard autoregressive decoder self-attention masks to learn P θ (y t |y t−1 , t, x).
Parallelization A major benefit of NAT is the ability to parallelize the model. In our approach, each neural computation of P θ (y t |y t−1 , t, x) can be done in parallel using shared encoder representations. The only sequential part in our approach is the Viterbi algorithm, which is not a bottleneck in practice. 1 For long T , this approach can be further parallelized to be of time complexity O(log T ) (Särkkä and García-Fernández, 2019;Rush, 2020 Model Settings We use the same architecture as the baselines, except that we use a singlelayer lattice scorer. For candidate lattice construction we only consider the top 40 candidates from P (y t |x at , a t − t) per each (x at , a t − t). For lattice decoding, we use linear regression to predict approximate length L from S. We introduce a padding symbol to allow for variable length generation and consider lengths T from L−∆L to L+∆L where L is the predicted length and ∆L = 3. Results

Analysis
Lattice Construction Various methods can construct a candidate lattice by filtering the top K tokens for each target-position t. The shared vocab reduction approach of L'Hostis et al. (2016) ignores the position t (P (y t |x, t) ≈ P (y · |x)). During training, FastAlign is used to estimate P (y · |x s ). During inference, for each source word x s , the top K words maximizing P (y · |x s ) are selected. Alternatively we can use an NAT baseline model, P (y|x) = t P (y t |x, t), where P (y t |x, t) is parameterized with a six-layer transformer encoderdecoder. This approach has access to the position and a deep decoder, but is much slower. Figure 1 compares different approaches for constructing a candidate lattice. Our method significantly outperforms shared vocab which produces a single reduced vocabulary at the target side. While our approach underperforms NAT baseline, it is competitive and much more efficient. While our statistical model is very efficient, using it alone (without lattice scoring and decoding, or equivalently, K = 1) gets a much lower BLEU, as shown in Table 2 Figure 3 plots BLEU score with the number of lattice scorer layers. We can see that a sequence-to-lattice formulation significantly outperforms the baseline NAT model, and that it enables using much fewer layers whereas the baseline accuracy quickly degrades as the number of scorer layers decreases. Being able to use fewer lattice scorer layers allows faster inference, as shown in Appendix A.1. The fact that NAT accuracy degrades indicates that structural changes, like those proposed for autoregressive models by Kasai et al. (2020), can hurt NAT models. Figure 2 demonstrates the importance of Viterbi search with respect to the final model. While almost 15% of words are already ranked highest without lattice decoding, there is a non-negligible percentage of changes due to search. Figure 4 shows latency breakdown as a function of length. Most time is spent   on the encoder since there are six encoder layers. Given the rarity of long sentences, in most cases the other two times are dominated by the encoder.

Latency Analysis
In practice, we use serial Viterbi decoding, which does grow linearly with length, but remains faster. Lattice scoring time is parallelizable but also grows with length. Practically hardware has lim-ited parallel capacity, creating a bottleneck when sequences become very long. Future work will need to better explore parallel approaches beyond the regime of sentence-level generation.

Conclusion
In this work, we find that using a sequence-tolattice formulation enables using much smaller model architectures for fast machine translation. Our approach first generates a candidate lattice using a statistical model, then uses a transformer with a position-wise head layer to score the lattice, and finally uses the Viterbi algorithm to find the best hypothesis. Experiments on three machine translation benchmarks show that our simple approach is very fast yet achieves a decent accuracy.  Table 3 shows the speedup and BLEU as we vary the number of lattice scorer layers. We can see that reducing the number of lattice score layers makes inference much faster without hurting BLEU score much.

A.2 Full Results
We used K = 64 in the main paper, and only reported latency/speedup on WMT14 En-De. Full results can be found at Table 4, Table 5, Table 6,  Table 7, and Table 9.

A.3 Data Preprocessing
To process the data, we use Byte Pair Encoding (BPE) (Sennrich et al., 2015;Kudo and Richardson, 2018) learned on the training set with a shared vocabulary between source and target. For IWSLT14 the vocabulary size is 10k; for WMT14 the vocabulary size 40k. For WMT16 we use the processed data provided by Lee et al. (2018).

A.4 Optimization Settings
We train our model as a Markov transformer (Deng and Rush, 2020) with bigrams to trigrams. We used Adam optimizer (Kingma and Ba, 2014), with β 1 = 0.9, β 2 = 0.98, and inverse square root learning rate decay after linear warmup (Ott et al., 2019). We train with label smoothing strength 0.1 (Müller et al., 2019). For model selection, we used BLEU score on validation set, with K = 64 and ∆L = 3. Other hyperparameters can be found at Table 8.

A.5 Implementation Details
Our implementation is based on FAIRSEQ (Ott et al., 2019) and PyTorch (Paszke et al., 2019), and we use an Nvidia A100 GPU with CUDA version 11.1 to perform inference.