Neighbors Are Not Strangers: Improving Non-Autoregressive Translation under Low-Frequency Lexical Constraints

However, current autoregressive approaches suffer from high latency. In this paper, we focus on non-autoregressive translation (NAT) for this problem for its efficiency advantage. We identify that current constrained NAT models, which are based on iterative editing, do not handle low-frequency constraints well. To this end, we propose a plug-in algorithm for this line of work, i.e., Aligned Constrained Training (ACT), which alleviates this problem by familiarizing the model with the source-side context of the constraints. Experiments on the general and domain datasets show that our model improves over the backbone constrained NAT model in constraint preservation and translation quality, especially for rare constraints.


Introduction
Despite the success of neural machine translation (NMT) (Bahdanau et al., 2015;Vaswani et al., 2017;Barrault et al., 2020), real applications usually require the precise (if not exact) translation of specific terms. One popular solution is to incorporate dictionaries of pre-defined terminologies as lexical constraints to ensure the correct translation of terms, which has been demonstrated to be effective in many areas such as domain adaptation, interactive translation, etc.
Previous methods on lexically constrained translation are mainly built upon Autoregressive Translation (AT) models, imposing constraints at inference-time (Ture et al., 2012;Hokamp and Liu, * Authors contributed equally. † Corresponding author. 1 Our implementation can be found at https:// github.com/sted-byte/ACT4NAT.  Table 1: Translation examples of a lexically constrained non-autoregressive translation (NAT) model (Gu et al., 2019) under a low-frequency word as constraint. The underbraced word frequencies (uncased) are calculated from the vast WMT14 English-German translation (En-De) datasets (Vaswani et al., 2017). 2017; Post and Vilar, 2018) or training-time (Luong et al., 2015;Ailem et al., 2021). However, such methods either are time-consuming in real-time applications or do not ensure the appearance of constraints in the output. To develop faster MT models for industrial applications, Non-Autoregressive Translation (NAT) has been put forth (Gu et al., 2018;Ghazvininejad et al., 2019;Gu et al., 2019;Qian et al., 2021), which aims to generate tokens in parallel, boosting inference efficiency compared with left-to-right autoregressive decoding.
Researches on lexically constrained NAT are relatively under-explored. Recent studies (Susanto et al., 2020;Xu and Carpuat, 2021) impose lexical constraints at inference time upon editing-based iterative NAT models, where constraint tokens are set as the initial sequence for further editing. However, such methods are vulnerable when encountered with low-frequency words as constraints. As illustrated in Table 1, when translated with a rare constraint, the model is unable to generate the correct context of the term "geschrien" as if it does not understand the constraint at all. It is dangerous since terms in specific domains are usually lowfrequency words. We argue that the main reasons behind this problem are 1) the inconsistency between training and constrained inference and 2) the unawareness of the source-side context of the constraints.
To solve this problem, we build our algorithm based on the idea that the context of a rare constraint tends not to be rare as well, i.e., "a stranger's neighbors are not necessarily strangers", as demonstrated in Table 1. We believe that, when the constraint is aligned to the source text, the context of its source-side counterpart can be utilized to be translated into the context of the target-side constraint, even if the constraint itself is rare. Also, when enforced to learn to preserve designated constraints at training-time, a model should be better at coping with constraints during inference-time.
Driven by these motivations, we propose a plugin algorithm to improve constrained NAT, namely Aligned Constrained Training (ACT). ACT extends the family of editing-based iterative NAT (Gu et al., 2019;Susanto et al., 2020;Xu and Carpuat, 2021), the current paradigm of constrained NAT. Specifically, ACT is composed of two major components: Constrained Training and Alignment Prompting. The former extends regular training of iterative NAT with pseudo training-time constraints into the state transition of imitation learning. The latter incorporates source alignment information of constraints into training and inference, indicating the context of the potentially rare terms.
In summary, this work makes the following contributions: • We identify and analyse the problems w.r.t.
rare lexical constraints in current methods for constrained NAT; • We propose a plug-in algorithm for current constrained NAT models, i.e., aligned constrained training, to improve the translation under rare constraints; • Experiments show that our approach improves the backbone model w.r.t. constraint preservation and translation quality, especially for rare constraints.

Related Work
Lexically Constrained Translation Existing translation methods impose lexical constraints during either inference or training. At training time, constrained MT models include code-switching data augmentation (Dinu et al., 2019;Song et al., 2019; and training with auxiliary tasks such as token or span-level mask-prediction (Ailem et al., 2021;Lee et al., 2021). At inference time, autoregressive constrained decoding algorithms include utilizing placeholder tag (Luong et al., 2015;Crego et al., 2016), grid beam search (Hokamp and Liu, 2017;Post and Vilar, 2018) and alignment-enhanced decoding (Alkhouli et al., 2018;Song et al., 2020;Chen et al., 2021). For the purpose of efficiency, recent studies also focus on non-autoregressive constrained translation. Susanto et al. (2020) proposes to modify the inference procedure of Levenshtein Transformer (Gu et al., 2019) where they disallow the deletion of constraint words during iterative editing. Xu and Carpuat (2021) further develops this idea and introduces a reposition operation that can reorder the constraint tokens. Our work absorbs the idea of both lines of work. Based on NAT methods, we brings alignment information by terminologies to help learn the contextual information for lexical constraints, especially the rare ones.
Non-Autoregressive Translation Although enjoy the speed advantage, NAT models suffer from performance degradation due to the multi-modality problem, i.e., generating text when multiple translations are plausible. Gu et al. (2018) applies sequence-level knowledge distillation (KD) (Kim and Rush, 2016) that uses an AT's output as an NAT's new target, which reduces word diversity and reordering complexity in reference, resulting in fewer modes (Zhou et al., 2020;Xu et al., 2021). Various algorithms have also been proposed to alleviate this problem, including incorporating latent variables (Kaiser et al., 2018;Shu et al., 2020), iterative refinement (Ghazvininejad et al., 2019;Stern et al., 2019;Gu et al., 2019;Guo et al., 2020), advanced training objective Du et al., 2021) and gradually learning targetside word inter-dependency by curriculum learning (Qian et al., 2021). Our work extends the family of editing-based iterative NAT models for its flexibility to impose lexical constraints (Susanto et al., 2020;Xu and Carpuat, 2021).

Non-Autoregressive Translation
Given a source sentence as x and a target sentence as y = {y 1 , · · · , y n }, an AT model generates in a left-to-right order, i.e., generating y t by conditioning on x and y <t . An NAT model (Gu et al., 2018), however, discards the word inter-dependency in output tokens, with the conditional independent probability distribution modeled as: Such factorization is featured with high efficiency at the cost of performance drop in translation tasks due to the multi-modality problem, i.e., translating in mixed modes and resulting in token repetition, missing, or incoherence.

Editing-based Iterative NAT
Iterative refinement by editing is an NAT paradigm that suits constrained translations due to its flexibility. It alleviates the multi-modality problem by being autoregressive in editing previously generated sequences while maintaining nonautoregressiveness within each iteration. Thus, it achieves better performance than fully NATs while is faster than ATs.
Levenshtein Transformer To better illustrate our idea, we use Levenshtein Transformer (LevT, Gu et al., 2019) as the backbone model in this work, which is a representative model for constrained NAT based on iterative editing.
LevT is based on the Transformer architecture (Vaswani et al., 2017), but more flexible and fast than autoregressive ones. It models the generation of sentences as Markov Decision Process (MDP) defined by a tuple (Y, A, E, R, y 0 ). At each decoding iteration, the agent E receives an input y ∈ Y, chooses an action a ∈ A and gets reward r. Y is a set of discrete sentences and R is the reward function. y 0 ∈ A is the initial sentence to be edited.
Each iteration consists of two basic operations, i.e., deletion and insertion, which is described in Table 2. For the k-th iteration of the sentence y k = (<s>, y 1 , ..., y n , </s>), the insertion consists of placeholder and token classifiers, and the deletion is achieved by a deletion classifier. LevT trains the model with imitation learning to insert and delete, which lets the agent imitate the behaviors drawn from the expert policy: • Learning to insert: edit to reference by inserting tokens from a fragmented sentence (e.g., random deletion of reference). • Learning to delete: delete from the insertion result of the current training status to the reference.
The key idea is to learn how to edit from a ground truth after adding noise or the output of an adversary policy to the reference. The ground truth of the editing process is derived from the Levenshtein distance (Levenshtein, 1965).
Lexically Constrained Inference Lexical constraints can be imposed upon a translation model in: 1) soft constraints: allowing the constraints not to appear in the translation; and 2) hard constraints: forcing the constraints to appear in the translation. In NAT, the constraints are generally incorporated at inference time. Susanto et al. (2020) injects constraints as the initial sequence for iterative editing in Levenshtein Transformer (LevT, Gu et al., 2019), achieving soft constrained translation. And hard constrained translation can be easily done by disallowing the deletion of the constraints. Xu and Carpuat (2021) alters the deletion action in LevT with the reposition operation, allowing the reordering of multiple constraints.

Motivating Study: Self-Constrained Translation
According to Table 1, constrained NAT models seem to suffer from the low-frequency of lexical constraints, which is dangerous as most terms in practice are rare. To further explore the impact of constraint frequency upon NATs, we conduct a preliminary analysis on constrained LevT (Susanto et al., 2020). We sort words in each reference text based on frequency, dividing them into six buckets by frequency order (as in Figure 1), and sample a word from each bucket as lexical constraints for translation. 2 We denote these constraints as selfconstraints. In this way, we have six times the data, and the six samples derived from one raw sample only differ in the lexical constraints. As shown in Figure 1, translation performance generally keeps improving as the self-constraint gets rarer. This is because setting low-frequency words in a sentence as constraints, which are often hard to translate, actually lightens the load of an NAT model. However, there are two noticeable performance drops around relative frequency ranges of 10%-30% (bucket 2) and 90%-100% (bucket 6), denoted as Drop#1 (-0.3 BLEU) and Drop#2 (-0.6 BLEU). Note that Drop#1 is mainly due to the the fact that there are mostly unknown tokens (i.e., <UNK>) in the bucket 2. We leave detailed discussions about buckets and Drop#1 to Appendix C.
In this experiment, we are more interested in the reasons for Drop#2 when constraints are lowfrequency words. We assume a trade-off in selfconstrained NAT: the model does not have to translate rare words as they are set as an initial sequence (constraints), but it will have a hard time understanding the context of the rare constraint due to 1) the rareness itself and 2) the lack of the alignment information between target-side constraint tokens and source tokens. Thus, the model does not know how many tokens should be inserted to the left and right of the constraint, which is consistent with the findings in Table 1.

Proposed Approach
The findings and assumptions discussed above motivate us to propose a plug-in algorithm for lexically constrained NAT models, i.e., Aligned Constrained Training (ACT). ACT is designed based on two major ideas: 1) Constrained Training: bridging the discrepancy between training and constrained inference; 2) Alignment Prompting: helping the model understand the context of the constraints.

Constrained Training
As introduced in §3.2, constraints are typically imposed during inference time in NAT (Susanto et al., 2020;Xu and Carpuat, 2021). Specifically, lexical constraints are imposed by setting the initial sequence y 0 as (<S>, C 1 , C 2 , ..., C k , </S>), where C i = (c 1 , c 2 , ..., c l ) is the i-th lexical constrained word, l is the number of tokens in the i-th constraint, and k is the number of constraints.
However, such mandatory preservation of the constraints is not carried out during training. During imitation learning, random deletion is applied for ground-truth y * to get the incomplete sentences y , producing the data samples for expert policies of how to insert from y to y * . This leads to a situation where the model does not learn to preserve fixed tokens and organize the translation around the tokens. Such discrepancy could harm the applications of soft constrained translation.
To solve this problem, we propose a simple but effective Constrained Training (CT) algorithm. We first build pseudo terms from the target by sampling 0-3 words (more tokens after tokenization) from reference as the pre-defined constraints for training. 3 Afterward, we disallow the deletion of pseudo term tokens during building data samples for imitation learning. This encourages the model to edit incomplete sentences containing lexical constraints into complete ones, bridging the gap between training and inference.

Alignment Prompting
As stated in §3.3, we assume the rareness of constraints hinders the model to insert proper tokens of its contexts (i.e., a stranger's neighbors are also strangers). To make the matter worse, previous research (Ding et al., 2021) has also shown that lexical choice errors on low-frequency words tend y 1 y 4

Token Embedding
Alignment Embedding Positional Embedding Figure 2: An example of alignment prompting. The constraint tokens y * are given by users during inference, and can also be sampled from target sentence during training. Given y * , we align them with tokens x * in the source and build alignment embeddings to be fed into the encoder.
to be propagated from the teacher (an AT model) to the student (an NAT model) in knowledge distillation. However, terminologies, by nature, provide hard alignment information for source and target which the model can conveniently utilize. Thus, on top of constrained training, we propose an enhanced approach named Aligned Constrained Training (ACT). As illustrated in Figure 2, we propose to directly align the target-side constraints with the source words and prompt the alignment information to the model during both training and inference.
Building Alignment for Constraints We first align the source words to the target-side constraints, which are either pseudo constraints during training or actual constraints during inference. For each translated sentence constraints C tgt = (C 1 , C 2 , ..., C k ), we use an external alignment tool external aligner, such as GIZA++ (Brown et al., 1993;Och and Ney, 2003), to find the corresponding source words, denoted as C src = (C 1 , C 2 , ..., C k ).
Prompting Alignment into LevT The encoder in LevT, besides token embedding and position embedding, is further added with a learnable align-  Table 3: Statistics of the test sets with target-side lexical constraints. "Avg. Len. of Con." denotes the average number of words in a constraint. "Avg. Con. Freq." is the average frequency of lexical constraints calculated with the training vocabularies of corresponding language. ment embedding that comes from C src and C tgt . We set the alignment value for each token in C i to i and the others to 0, which are further encoded into embeddings. The prompting of alignment is not limited to training, as we also add such alignment embeddings to source tokens aligned to target-side constraints during inference.

Data and Evaluation
Parallel Data and Knowledge Distillation We consider the English→German (En→De) translation task and train all of the MT models on WMT14 En-De (3,961K sentence pairs), a benchmark translation dataset. All sentences are pre-processed via byte-pair encoding (BPE) (Sennrich et al., 2016) into sub-word units. Following the common practice of training an NAT model, we use the sentencelevel knowledge distillation data generated by a Transformer, (Vaswani et al., 2017) provided by Kasai et al. (2020).

Datasets with Lexical Constraints
Given models trained on the above-mentioned training sets, we evaluate them on the test sets of several lexically constrained translation datasets. These test sets are categorized into two types of standard lexically constrained translation datasets: 1) Type#1: tasks from WMT14 (Vaswani et al., 2017) and WMT17 (Bojar et al., 2017), which are of the same general domain (news) as training sets; 2) Type#2: tasks from OPUS (Tiedemann, 2012) that are of specific domains (medical and law). Particularly, the real application scenarios of lexically constrained MT models are usually domain-specific, and the constrained words in these domain datasets are relatively less frequent and more important.  Following previous work (Dinu et al., 2019;Susanto et al., 2020;Xu and Carpuat, 2021), the lexical constraints in Type#1 tasks are extracted from existing terminology databases such as Interactive Terminology for Europe (IATE) 4 and Wiktionary (WIKT) 5 accordingly. The OPUS-EMEA (medical domain) and OPUS-JRC (legal domain) in Type#2 tasks are datasets from OPUS. The constraints are extracted by randomly sampling 1 to 3 words from the reference (Post and Vilar, 2018). These constraints are then tokenized with BPE, yielding a larger number of tokens as constraints. The statistical report is shown in Table 3, indicating the frequencies of Type#2 datasets are generally much lower than Type#1 ones.

Evaluation Metrics
We use BLEU (Papineni et al., 2002) for estimating the general quality of translation. We also use Term Usage Rate (Term%, Dinu et al., 2019;Susanto et al., 2020;Lee et al., 2021) to evaluate lexically constrained translation, which is the ratio of term constraints appearing in the translated text.

Models
We use Levenshtein Transformer (LevT, Gu et al., 2019) as the backbone model to ACT algorithm for constrained NAT. We compare our approach with a series of previous MT models on applying lexical constraints: • Transformer (Vaswani et al., 2017), set as the AT baseline; • Dynamic Beam Allocation (DBA) (Post and Vilar, 2018) for constrained decoding with dynamic beam allocation over Transformer; • Train-by-sep (Dinu et al., 2019), trained on augmented code-switched data by replacing the source terms with target constraints or append on source terms during training; • Constrained LevT (Susanto et al., 2020), which develops LevT (Gu et al., 2019) by setting constraints as initial editing sequence; • EDITOR (Xu and Carpuat, 2021), a variant of LevT, replacing the delete action with a reposition action.

Implementation Details
We use and extend the FairSeq framework (Ott et al., 2019) for training our models. We keep mostly the default parameters of FairSeq, such as setting d model = 512, d hidden = 2,048, n heads = 8, n layers = 6 and p dropout = 0.3. The learning rate is set as 0.0005, the warmup step is set as 4,000 steps. All models are trained with a batch size of 16,000 tokens for maximum of 300,000 steps with Adam optimizer (Kingma and Ba, 2014) on 2 NVIDIA GeForce RTX 3090 GPUs with gradient accumulation of 4 batches. Checkpoints for testing are selected from the average weights of the last 5 checkpoints. For Transformer (Vaswani et al., 2017), we use the checkpoint released by Ott et al. (2018). Table 4 reports the performance of LevT with ACT (as well as the CT ablation) on the type 1 tasks (WIKT and IATE as terminologies), compared with baselines. In general, the results indicate the proposed CT/ACT algorithms achieve a consistent gain in performance, term coverage, and speed over the backbone model mainly in the setting of constrained translation. When translating with soft constraints, i.e., the constraints need not appear in the output, adding ACT to LevT helps preserve the terminology constraints (+∼5 Term%) and improves translation performance (+0.31-0.88 on BLEU). If we enforce hard constraints, the term usage rate doubtlessly reaches 100%, with reasonable improvements on BLEU. When translating without constraints, however, adding ACT does not bring consistent improvements as hard and soft constraints do.

Main Results
As for the ablation for CT and ACT, we have two observations: 1) term usage rate increases mainly because of CT, and can be further improved by ACT; 2) translation quality (BLEU) increases due to the additional hard alignment of ACT over CT. The former could be attributed to the behavior of not deleting the constraints in CT. The latter is because of the introduction of source-side information of constraints that familiarize the model with the constraint context. Table 4 also shows the efficiency advantage of non-autoregressive methods compared with autoregressive ones, which is widely reported in the NAT research literature. The proposed methods do not cause drops in translation speed against the backbone LevT. When translating with lexical constraints, LevT with CT or ACT is even faster than LevT. In contrast, constrained decoding methods for autoregressive models (i.e., DBA) nearly double the translation latency. Since the main purpose of non-autoregressive research is developing efficient algorithms, such findings could facilitate the industrial usage for constrained translation.
Translation Results on Domain Datasets For a generalized evaluation of our methods, we apply the models trained on the general domain dataset (WMT14 En-De) to medical (OPUS-EMEA) and legal domains (OPUS-JRC). As seen in Table 5  10 Term%) and translation performance (up to 4 BLEU points) largely increase, which is more significant than the general domain. The reason behind this observation is that the backbone LevT would have a hard time recognizing them as constraints since the lexical constraints in these datasets are much rarer. Therefore, forcing LevT to translate with these rare constraints would generate worse text, e.g., BLEU drops for 2.45 points on OPUS-JRC than with soft constraints. And when translating with soft constraints, LevT over-deletes these rare constraints.
In contrast, the context information around constraints is effectively pin-pointed by ACT, so ACT would know the context ("neighbors") of the rare constraint ("strangers") and insert the translated context around the lexical constraints. In this way, more terms are preserved by ACT, and the translation achieves better results. 6 Analysis 6.1 Self-Constrained Translation Revisited As a direct response to our motivation in this paper, we revisit the ablation study of self-constrained NAT in §3.3 with the proposed ACT algorithm. Same as before, we build self-constraints from each target sentence and sort them by frequency. As shown in Figure 3(a), different from constrained LevT that suffers from Drop#2 ( §3.3), ACT managed to handle this scenario pretty well. Following the motivations given in §3.3, when constraints become rarer, ACT successfully breaks the trade-off with better understanding of the provided contextual information.
What if the self-constraints are sorted based on TF-IDF? We also study the importance of different words in a sentence via TF-IDF by forcing  them as constraints. As results in Figure 3(b) show, we have very similar observations from frequencybased self-constraints at Figure 3(a), and the gap between LevT and LevT + ACT is even higher as TF-IDF score reaches the highest.

How does ACT perform under different kinds of lexical constraints?
The experiments in §6.1 create pseudo lexical constraints by traversing the target-side reference for understanding the proposed ACT. In the following analyses, we study different properties of lexical constraints, e.g., frequency and numbers, and how they affect constrained translation.
Are improvements by ACT robust against constraints of different frequencies? Given terminology constraints in the samples, we sort them by (averaged) frequency and evenly average the corresponding data samples into high, medium and low categories. 6 The results on translation quality of each category for the En→De translation tasks are presented in Table 6. We find that LevT benef its mostly from ACT in the scenarios of lowerfrequency terms for three datasets. Although, in some settings such as HIGH in WMT14-WIKT and MED in WMT17-WIKT, the introduction of ACT for constrained LevT seems to bring performance drops for those higher-frequency terms. Since terms from IATE are rarer than WIKT as in Table  3, the improvements brought by ACT are steady.
Are improvements by ACT robust against constraints of different numbers? In more practical settings, the number of constraints is usually more than one. To simulate this, we randomly sample 1-5 words from each reference as lexical constraints, and results are presented in Figure 4. We find that, as the number of constraints grows, the translation quality ostensibly becomes better for LevT with or without ACT. And ACT consistently brings extra improvements, indicating the help by ACT for constrained decoding in constrained NAT.

Limitations
Although the proposed ACT algorithm is effective to improve NAT models on constrained translation, we also find it does not bring much performance gain on translation quality (i.e., BLEU) over the backbone LevT for unconstrained translation. The results on the full set of WMT14 En→De test set further corroborate this finding, which is shown in Appendix A. Another limitation of our work is that we do not propose a new paradigm for constrained NAT. The  purpose of this work is to enhance existing methods for constrained NAT, i.e., editing-based iterative NAT methods, under rare lexical constraints. It would be interesting for future research to explore new ways to impose lexical constraints on NAT models, perhaps on non-iterative NAT. Note that, machine translation in real scenario still falls behind human performance. Moreover, since we primary focus on improving constrained NAT, real applications calls for refinement in various aspects that we do not consider in this work.

Conclusion
In this work, we propose a plug-in algorithm (ACT) to improve lexically constrained nonautoregressive translation, especially under lowfrequency constraints. ACT bridges the gap between training and constrained inference and prompts the context information of the constraints to the constrained NAT model. Experiments show that ACT improves translation quality and term preservation over the backbone NAT model Levenshtein Transformer. Further analyses show that the findings are consistent over constraints varied from frequency, TF-IDF, and lengths. In the future, we will explore the application of this approach to more languages. We also encourage future research to explore a new paradigm of constrained NAT methods beyond editing-based iterative NAT.

A Results on Full Test Set of WMT14 (En→De)
We extend the experiment on WMT14 En→De task to the full test set (3,003 samples) in Table 7. Following Susanto et al., we report results on both the filtered test set for sentence pairs that contain at least one target constraint ("Con.", 454 sentences) and the full test set ("Full", 3,003 sentences), which contains samples that do not have lexical constraints. When trained on the full test set, term usage rate raises from 94.88% to 98.82% when trained with ACT under soft constrained decoding, but the BLEU score has marginal improvements. The conclusion is consistent with the experiments in the main body of the paper that LevT with ACT is not significantly better than LevT on unconstrained translation, though our main claim rests on the scenario of constrained NAT.  Table 7: Experiments on the test set of WMT14 En→De task, which shares the same domain of training set. Following Susanto et al. (2020), "Con." is the subset of WMT14-Full as shown in Table 3, where every sample has at least one lexical term as constraint.

B Case Study
The case study of LevT and LevT with ACT is presented in Table 8. In the case of unconstrained or soft constrained translation, LevT incorrectly translates low frequency constraint words (e.g., Hühnerfeiern in case 1). In the case of hard constrained translation, LevT tends to have more interfering words around the constraint words (e.g., sind in case 1). After incorporating ACT, we witness consistent improvements in the translation of the constraints for LevT, especially for soft constrained translation where it successfully translates given constraints. However, when the translation is not constrained on lexical terms (i.e., unconstrained translation), LevT with ACT still struggles at translating the term correctly (both case 1 and 2).  Bucket # PUNC # NN* # (JJ*,RB*,VB*) # UNK # OTHER # ALL   1  1,300  971  433  0  63  2,767  2  148  1,520  567  186  346  2,767  3  12  1,926  531  97  201  2,767  4  2  2,298  308  4  155  2,767  5  0  2,377  208  3  179  2,767  6  0  2,336  134  5  292  2,767   Table 9: The counted statistics of constraint tokens within each bucket in self-constrained translation study, where tokens are categorized according to their Part-Of-Speech tags. Among them, PUNC denotes punctuation; NN* denotes all sets of nouns (whose POS tags start with NN, including NN, NNP, NNS, NNPS, etc.); JJ*, RB* and VB* denotes all kinds of adjectives, adverbs and verbs; UNK is the constraint with UNK token and some special symbols; and others as denoted as OTHER.

C Unraveling the Buckets in Self-Constrained Translation
In this section, we dig further into the buckets in self-constrained translation ( §3.3, §6.1), especially for understanding why Drop#1 happens. As seen in Table 9, we categorize and count the constraints into five classes based on their Part-Of-Speech tagging with NLTK (Bird et al., 2009). We find that, 1) punctuation (PUNK) dominates bucket 1; 2) as the constraint frequency decreases (from bucket 1 to bucket 6), the number of constraints identified as nouns (NN*) grows; 3) bucket 2 has the most UNK constraints. The third finding is because, as the BPE training was only done on the training set of the datasets, there will be <UNK> on the target side of the test set. Thus, cases in bucket 2 have a relatively large number of UNK tokens as constraints, resulting in the Drop#1.
To give a clearer view about how is UNK caus-ing Drop#1, we exclude samples with UNK as constraints, and obtain a revised self-constrained translation results, as in Figure 5. Clearly, Drop#1 disappears in the given setting. Of course, Drop#2 still verifies our claim in the paper.