A Template-based Method for Constrained Neural Machine Translation

Machine translation systems are expected to cope with various types of constraints in many practical scenarios. While neural machine translation (NMT) has achieved strong performance in unconstrained cases, it is non-trivial to impose pre-specified constraints into the translation process of NMT models. Although many approaches have been proposed to address this issue, most existing methods can not satisfy the following three desiderata at the same time: (1) high translation quality, (2) high match accuracy, and (3) low latency. In this work, we propose a template-based method that can yield results with high translation quality and match accuracy and the inference speed of our method is comparable with unconstrained NMT models. Our basic idea is to rearrange the generation of constrained and unconstrained tokens through a template. Our method does not require any changes in the model architecture and the decoding algorithm. Experimental results show that the proposed template-based approach can outperform several representative baselines in both lexically and structurally constrained translation tasks.


Introduction
Constrained machine translation is of important value for a wide range of practical applications, such as interactive translation with user-specified lexical constraints (Koehn, 2009;Jon et al., 2021), domain adaptation with in-domain dictionaries (Michon et al., 2020;Niehues, 2021), and webpage translation with markup tags as structural constraints (Hashimoto et al., 2019;Hanneman and Dinu, 2020). Developing constrained neural machine translation (NMT) approaches can make NMT models applicable to more real-world scenarios (Bergmanis and Pinnis, 2021).
However, it is challenging to directly impose constraints for NMT models due to their end-toend nature (Post and Vilar, 2018). In accordance with this problem, a branch of studies modifies the decoding algorithm to take the constraints into account when selecting candidates (Hokamp and Liu, 2017;Hasler et al., 2018;Post and Vilar, 2018;Hu et al., 2019;Hashimoto et al., 2019). Although constrained decoding algorithms can guarantee the presence of constrained tokens, they can significantly slow down the translation process (Wang et al., 2022) and can sometimes result in poor translation quality (Zhang et al., 2021).
Another branch of works constructs synthetic data to help NMT models acquire the ability to translate with constraints (Song et al., 2019;Dinu et al., 2019;Michon et al., 2020). For instance, Hanneman and Dinu (2020) propose to inject markup tags into plain parallel texts to learn structurally constrained NMT models. The major drawback of data augmentation based methods is that they sometimes violate the constraints (Hanneman and Dinu, 2020;Chen et al., 2021), limiting their application in constraint-critical situations.
In this work, we use free tokens to denote the tokens that are not covered by the provided constraints. Our motivation is to decompose the whole constrained translation task into the arrangement of constraints and the generation of free tokens. The constraints can be of many types, ranging from phrases in lexically constrained translation to markup tags in structurally constrained translation. Intuitively, only arranging the provided constraints into the proper order is much easier than generating the whole sentence. Therefore, we build a template by abstracting free token fragments into nonterminals, which are used to record the relative position of all the involved fragments. The template can be treated as a plan of the original sentence. The arrangement of constraints can be learned through a template generation sub-task.
Once the template is generated, we need some derivation rules to convert the nonterminals mentioned above into free tokens. Each derivation rule shows the correspondence between a nonterminal and a free token fragment. These rules can be learned by the NMT model through semi-structured data. We call this sub-task template derivation. During inference, the model firstly generates the template and then extends each nonterminal in the template into natural language text. Note that the two proposed sub-tasks can be accomplished through a single decoding pass. Thus the decoding speed of our method is comparable with unconstrained NMT systems. By designing template format, our approach can cope with different types of constraints, such as lexical constraints, XML structural constraints, or Markdown constraints.
Contributions In summary, the contributions of this work can be listed as follows: • We propose a novel template-based constrained translation framework to disentangle the generation of constraints and free tokens.
• We instantiate the proposed framework with both lexical and structural constraints, demonstrating the flexibility of this framework.
• Experiments show that our method can outperform several strong baselines, achieving high translation quality and match accuracy while maintaining the inference speed.
2 Related Work

Lexically Constrained Translation
Several researchers direct their attention to modifying the decoding algorithm to impose lexical constraints (Hasler et al., 2018). For instance, Hokamp and Liu (2017) propose grid beam search (GBS) that organizes candidates in a grid, which enumerates the provided constrained tokens at each decoding step. However, the computation complexity of GBS scales linearly with the number of constrained tokens. To reduce the runtime complexity, Post and Vilar (2018) propose dynamic beam allocation (DBA), which divides a fixed size of beam for candidates having met the same number of constraints. Hu et al. (2019) propose to vectorize DBA further. The resulting VDBA algorithm is still significantly slower compared with the vanilla beam search algorithm (Wang et al., 2022).
Another line of studies trains the model to copy the constraints through data augmentation. Song et al. (2019) propose to replace the corresponding source phrases with the target constraints, and Dinu et al. (2019) propose to insert target constraints as inline annotations. Some other works propose to append target constraints to the whole source sentence as side constraints (Chen et al., 2020;Niehues, 2021;Jon et al., 2021). Although these methods introduce little additional computational overhead at inference time, they can not guarantee the appearance of the constraints (Chen et al., 2021). Xiao et al. (2022) transform constrained translation into a bilingual text-infilling task. A limitation of text-infilling is that it can not reorder the constraints, which may negatively affect the translation quality for distinct language pairs.
Recently, some researchers have tried to adapt the architecture of NMT models for this task. Susanto et al. (2020) adopt non-autoregressive translation models (Gu et al., 2019) to insert target constraints. Wang et al. (2022) prepend vectorized keys and values to the attention modules (Vaswani et al., 2017) to integrate constraints. However, their model may still suffer from low match accuracy when decoding without VDBA. In this work, our method can achieve high translation quality and match accuracy without significantly increasing the inference overhead.

Structurally Constrained Translation
Structurally constrained translation is useful since text data is often wrapped with markup tags on the Web (Hashimoto et al., 2019), which is an essential source of information for humans. Compared with lexically constrained translation, structurally constrained translation is relatively unexplored. Joanis et al. (2013) examine a two-stage method for statistical machine translation systems, which firstly translates the plain text and then injects the tags based on phrase alignments and some carefully designed rules. Moving to the NMT paradigm, large-scale parallel corpora with structurally aligned markup tags are scarce. Hanneman and Dinu (2020) propose to inject tags into plain text to create synthetic data. Hashimoto et al. (2019) collect a parallel dataset consisting of structural text translated by human experts. Zhang et al. (2021) propose a constrained decoding algorithm to translate structured text. However, their method significantly slows down the translation process.
In this work, our approach can be easily extended for structural constraints, leaving the decoding algorithm unchanged. The template in our approach can be seen as an intermediate plan, which has been investigated in the field of data-to-text generation (Moryossef et al., 2019).  also explored the idea of disentangling different parts in a sentence using special tokens.

Template-based Machine Translation
Given a source-language sentence x = x 1 · · · x I and a target-language sentence y = y 1 · · · y J , an NMT model is trained to estimate the conditional probability P (y|x; θ), which can be given by where θ is the set of parameters to optimize and y <j is the partial translation at the j-th step.
In this work, we firstly build a template to simplify the whole sentence. Formally, we use s and t to represent the source-and target-side templates, respectively. In the template, free token fragments are abstracted into nonterminals. We use e and f to denote the derivation rules of the nonterminals for the source and target template, respectively.
The model is trained on two sub-tasks. Firstly, the model learns to generate the target template t: P (t|s, e; θ) = T j=1 P (t j |s, e, t <j ; θ). (2) Secondly, we train the same model to estimate the conditional probability of f : The target sentence y can be reconstructed by extending each nonterminal in t using the corresponding derivation rule in f . We can jointly learn the two sub-tasks in one pass to improve both the training and inference efficiency. Formally, the model is trained to maximize the following joint probability of t and f in practice: P (t, f |s, e; θ) = P (t|s, e; θ) × P (f |s, e, t; θ). (4)

Template for Lexical Constraints
In lexically constrained translation, some source phrases in the input sentence are required to be translated into pre-specified target phrases. For a source sentence x, we use u (n) , v (n) N n=1 to denote the given constraint pairs, where u (n) is the n-th source constraint, and v (n) is the corresponding target constraint. All the N source constraints can divide x into 2N + 1 fragments: where p (n) is the n-th free token fragment. We can set p (0) to an empty string to represent sentences that start with a constraint, and set p (N ) to an empty string for sentences that end with a constraint. We can also set p (n) to an empty string for the cases where u (n) and u (n+1) are adjacent in x. Similarly, the target sentence can be represented by where q (n) is the n-th free token fragment in the target sentence y. We use i 1 , · · · , i N to denote the order of the constraints in y. The n-th index i n is not necessarily equal to n, since the order of the constraints in the target sentence y is often different from that in the source sentence x. We then abstract each fragment of text into nonterminals to build the template for lexically constrained translation. Concretely, the n-th free token fragment in the source sentence x is abstracted into X n , for each n ∈ {0, · · · , N }. The n-th free token fragment in the target sentence is abstracted into Y n , for each n ∈ {0, · · · , N }. In order to indicate the alignment between corresponding source and target constraints, we abstract u n and v n into the same nonterminal C n . Note that X n and Y n are not linked nonterminals, since fragments of free tokens are not bilingually aligned. The resulting source-and target-side templates are given by We need to define some derivation rules to convert the template into a natural language sentence. The derivation of nonterminals can be seen as the inverse of the abstraction process. Thus the derivation of the target-side template t would be

Source x
Target y 歌曲 七⾥⾹ 的演唱者是 周杰伦 。 Jay Chou sang the song Orange Jasmine . Figure 1: Example for lexically constrained translation. The constraints are ⟨周杰伦, Jay Chou⟩ and ⟨七里香, Orange Jasmine⟩. Note that X n and Y n are not linked nonterminals, since the source and target free token fragments are not necessarily aligned. The derivation rule X 0 → 歌曲 is learned through the concatenation of X 0 and 歌曲 (i.e., X 0 歌曲). "ϕ" denotes an empty string. See Section 3.2 for more details.

Constraints
The derivation of the source-side template s can be defined similarly. Note that C n produces the n-th source constraint u n at the source side while producing the target constraint v n at the target side. In order to make the derivation rules learnable by NMT models, we propose to use the concatenation of the nonterminal and the corresponding sequence of terminals to denote each derivation rule. For example, we use Y n q (n) to represent Y n → q (n) . We use d and f to denote the derivation of constraints and free tokens at the target side, respectively: At the source side, we use c and e to denote the derivation of constraints and free tokens, respectively. c and e can be defined similarly. Since the constraints are pre-specified by the users, the model only needs to learn the derivation of free tokens. To this end, we place the derivation of constraint-related nonterminals before the template as a conditional prefix. Then the model learns the generation of the template and the derivation of free tokens, step by step.
The final format of the input and output sequences at training time can be given by respectively. We use the delimiter <sep> to separate the template and the derivations. Figure 1 gives an example of both x ′ and y ′ . At inference time, we feed x ′ to the encoder, and provide "d <sep>" to the decoder as the constrained prefix. Then the model generates the remaining part of y ′ (i.e., "t <sep> f ").
Jay Chou sang the song C 2 Y 2 Jay Chou sang the song Orange JasmineY 2 Jay Chou sang the song Orange Jasmine .

Template
Natural Language Sentence Figure 2: The template can be converted into a natural language sentence by replacing the nonterminals according to the corresponding derivation rules. Figure 2 explains the way we convert the output sequence into a natural language sentence. The conversion from the template to the target-language sentence can be done through a simple script, and the computational cost caused by the conversion is negligible, compared with the model inference.
Note that we also abstract the constraints when building the template. The reason is that the model only needs to generate the order of constraints in this way, rather than copy all the specific tokens, which may suffer from copy failure (Chen et al., 2021). The formal representation for our lexically constrained model is slightly different from that defined in Eq. (4), which should be changed into P (t, f |c, s, e, d; θ) =P (t|c, s, e, d; θ) × P (f |c, s, e, d, t; θ). (11) Derivation of Free Tokens f sang the song Template t

Input x′
Output y′ Source x Target y 歌曲 <i> 七⾥⾹ </i> 的演唱者是 <b> 周杰伦 </b> 。 <b> Jay Chou </b> sang the song <i> Orange Jasmine </i> . Figure 3: Example for structurally constrained translation. The markup tags are reserved in the template, while free tokens are abstracted. Note that X n and Y n are not linked nonterminals. See Section 3.3 for more details.

Template for Structural Constraints
The major challenge of structured text translation is to maintain the correctness of the structure, which is often indicated by markup tags (Hashimoto et al., 2019). The proposed framework can also deal with structurally constrained translation. Similarly, we replace free token fragments with nonterminals to build the template, where the markup tags are reserved. Figure 3 shows an example. Formally, given a sentence pair ⟨x, y⟩ with N markup tags, the source-and target-side templates are given by respectively. The order of markup tags at the target side (i.e., i 1 · · · i N ) may be different from that at the source side (i.e., 1 · · · N ). For each n ∈ {0, · · · , N }, X n can be derived into the n-th source-side free token fragment p (n) , and Y n can be extended into the target-side free token fragment q (n) . X n and Y n are not linked. The derivation sequences can be defined as The format of the input and output would be respectively. Figure 3 illustrates an example for both x ′ and y ′ . The formal representation of our structurally constrained model is the same as Eq. (4). The model arranges the markup tags when generating t and completes the whole sentence when generating f , which is consistent with our motivation to decompose the whole task into constraint arrangement and free token generation.

Setup
Parallel Data We conduct experiments on two language pairs, including English-Chinese and English-German. For English-Chinese, we use the dataset of WMT17 as the training corpus, consisting of 20.6M sentence pairs. For English-German, the training data is from WMT20, containing 41.0M sentence pairs. We provide more details of data preprocessing in Appendix. Following recent studies on lexically constrained translation (Chen et al., 2021;Wang et al., 2022), we evaluate our method on human-annotated alignment test sets. For English-Chinese, both the validation and test sets are from Liu et al. (2005). For English-German, the test set is from Zenkel et al. (2020). We use newstest2013 as the validation set, whose word alignment is annotated by fast-align 2 . The training sets are filtered to exclude test and validation sentences.
Lexical Constraints Following some recent works (Song et al., 2019;Chen et al., 2020Chen et al., , 2021Wang et al., 2022), we simulate real-world lexically constrained translation scenarios by sampling constraints from the phrase table that are extracted from parallel sentence pairs based on word alignment. The script used to create the constraints is publicly available. 3 Specifically, the number of constraints for each sentence pair ranges between 0 and 3, and the length of each constraint ranges between 1 and 3 tokens. We use fast-align to build the alignment of the training data.

Model Configuration
We adopt Transformer (Vaswani et al., 2017) as our NMT model, which is optimized by Adam (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98 and ϵ = 10 −9 . Please refer to Appendix for more details on the model configuration and the training process.

Evaluation Metrics
We follow Alam et al.
(2021a) to use the following four metrics to make a thorough comparison of the involved methods: • BLEU (Papineni et al., 2001): measuring the translation quality of the whole sentence; • Exact Match: indicating the accuracy that the source constraints in the input sentences are translated into the provided target constraints; • Window Overlap: quantifying the overlap ratio between the hypothesis and the reference windows for each matched target constraint, indicating if this constraint is placed in a suitable context. The window size is set to 2.
• 1-TERm: modifying TER (Snover et al., 2006) by setting the edit cost of constrained tokens to 2 and the cost of free tokens to 1.

Main Results
Template Accuracy We firstly examine the performance of the model in the template generation sub-task before investigating the translation performance. We compare the target-side template extracted from the reference sentence and the one generated by the model to calculate the accuracy of template generation. Formally, if the reference template t is In other words, the model must generate all the nonterminals to guarantee the presence of the provided constraints. However, the order of constraintrelated nonterminals can be flexible since there often exist various suitable orders for the provided constraints. In both English-Chinese and English-German, the template accuracy of our model is 100%. An interesting finding is that our model learns to reorder the constraints according to the style of the target language. We provide an example of constraint reordering in Table 1.
When generating the free token derivation f , the model can recall all the nonterminals (i.e., Y n ) presented in the template t in English-Chinese. In English-German, however, the model omits one free token nonterminal, of which the frequency is 0.2%. We use empty strings for the omitted nonterminals when reconstructing the output sentence. Table 2 shows the results of lexically constrained translation, demonstrating that all the investigated methods can recall more provided constraints than the unconstrained Transformer model. Our approach can improve the BLEU score over the involved baselines. This improvement potentially comes from two aspects: (1) our system outputs can match more pre-specified constraints compared to some baselines, such as AttnVector (Wang et al., 2022) (100% vs. 93.8%) ;

Translation Performance
(2) our method can place more constraints in appropriate context, which can be measured by window overlap. The exact match accuracy of VDBA (Hu et al., 2019) is lower than 100% due to the out-ofvocabulary problem in English-Chinese.
TextInfill (Xiao et al., 2022) and our approach can achieve 100% exact match accuracy in both the two language pairs. However, TextInfill can only place the constraints in the pre-specified order, Constraints ⟨slowing down,减弱⟩; ⟨price hike,价格上涨⟩

Source
Analysts are concerned that since there is no sign yet of any slowing down of this price hike , the prospect of the British real estate market as where it is heading now is far from optimistic.

Reference
分析家担心, 由于目前还看不见 价格上涨 趋势有 减弱 的迹象, 照此发展下去, 英国房地产市场前景堪 忧。 Input (enc) C 1 slowing down C 2 price hike <sep> X0 C 1 X1 C 2 X2 <sep> X0 Analysts are concerned that since there is no sign yet of any X1 of this X2 , the prospect of the British real estate market as where it is heading now is far from optimistic. Table 1: An example of our method. We replace the nonterminals in the template using the derivation rules to reconstruct the final result (i.e., "Result"). Surprisingly, we find that our model can automatically sort the provided constraints when generating the template. In this example, C 1 is before C 2 in the source-side template. But in the target-side template generated by our model, C 2 is before C 1 , which is more suitable for the target language.  Table 2: Results of the lexically constrained translation task for both English-Chinese and English-German. For clarity, we highlight the highest score in bold and the second-highest score with underlines.
while our approach can automatically reorder the constraints. As a result, the window overlap score of our approach is higher than TextInfill. Please refer to Table 8 in Appendix for more translation examples of both our method and some baselines

Unconstrained Translation
A concern for lexically constrained translation methods is that they may cause poor translation quality in unconstrained translation scenarios. We thus evaluate our approach in the standard translation task, where the model is only provided with the source sentence x. Under this circumstance, the input and output can be given by respectively. The BLEU scores of our method are 42.6 and 25.0 for English-Chinese and English-German, respectively. The performance of our method is comparable with the vanilla model, which can dispel the concern that our approach may worsen the unconstrained translation quality.   Table 3 shows the decoding speed. Since we did not change the model architecture and the decoding algorithm, the speed of our method is close to the vanilla Transformer model (Vaswani et al., 2017). Although our speed is almost the same as the vanilla model, our inference time is a bit longer, given the fact that the output sequence y ′ is longer than the original target-language sentence y.  Table 4: Results of the structurally constrained translation task. We highlight the highest score in bold and the second-highest score with underlines. We vary the amounts of training data to investigate the effect of data scale on our approach. Figure 4 shows the results. The BLEU score increases with the data size, while the window overlap score reaches the highest value when using 10.0M training examples. When using all the training data, the 1 -TERm metric achieves the best value. We find that the exact match accuracy of our method is maintained at 100%, even with only 0.6M training examples. This trend implies that our method can be applied in some low-resource scenarios.

More Analysis
Due to space limitation, we place a more detailed analysis of our approach in Appendix, including the effect of the alignment model, the performance on more language pairs, and the domain robustness of our model, which is evaluated on the WMT21 terminology translation task (Alam et al., 2021b) that lies in the COVID-19 domain.

Setup
Data We conduct our experiments on the dataset released by Hashimoto et al. (2019), which supports the translation from English to seven other languages. We select four languages, including French, Russian, Chinese, and German. For each language pair, the training set contains roughly 100K sentence pairs. We report the results on the validation sets since the test sets are not opensourced. We follow Hashimoto et al. (2019) to use SentencePiece 5 to preprocess the data, which supports user-defined special symbols. The model type of SentencePiece is set to unigram, and the vocabulary size is set to 9000. For English-Chinese, we over-sample the English sentences when learning the joint tokenizer, since Chinese has more unique characters than English (Hashimoto et al., 2019). We did not perform over-sampling for other language pairs. We register the XML tags and URL placeholders as user-defined special symbols. In addition, we also register &amp;, &lt;, and &gt; as special tokens, following Hashimoto et al. (2019).

3673
Model Configuration Since the data scale for structurally constrained translation is much smaller than lexically constrained translation, we follow Hashimoto et al. (2019) to set the width of the model to 256 and the depth of the model to 6. See Section B.1 in Appendix for more details.
Baselines We compare our approach with the following three baselines: • Remove: removing the markup tags and only translating the plain text; • Split-Inject (Al-Anzi et al., 1997): splitting the input sentence based on the markup tags and then translating each text fragment independently, and finally injecting the tags; • XML (Hashimoto et al., 2019): directly learning the NMT model end-to-end using parallel sentences with XML tags.

Evaluation Metrics
We follow Hashimoto et al. (2019) to use the following metrics: • BLEU: considering the structure when estimating BLEU score (Papineni et al., 2001); • Structure Accuracy: utilizing the etree package to check if the system output is a valid XML structure (i.e., Correct), and if the output structure exactly matches the structure of the given reference (i.e., Match).
All the metrics are calculated using the evaluation script released by Hashimoto et al. (2019).

Main Results
Template Accuracy We firstly examine the accuracy of the generated templates. A generated template is correct if • the template is a valid XML structure; • the template recalls all the markup tags of the input sentence.
The template accuracy of our method is 100% in all the four language pairs. Similar to lexically constrained translation, the model may omit some free token nonterminals (i.e., Y n ) when generating the derivation f , of which the ratios are 0.4%, 0.6%, 0.1%, 0.9% in English-French, English-Russian, English-Chinese, English-German, respectively. We use empty strings for the omitted nonterminals when reconstructing the output sentence. Table 4 shows the results of all the involved methods. Our approach can improve the BLEU score over the three baselines, and the structure correctness is 100%. Although Split-Inject can also guarantee the correctness of the output, its BLEU score is much lower, which is potentially caused by the reason that some fragments are translated without essential context. The structure match accuracy with respect to the given reference is not necessarily 100%, since the order of markup tags can be diverse due to the variety of natural language. See Table 9 in Appendix for some translation examples.

Conclusion
In this work, we propose a template-based framework for constrained translation and apply the framework to two specific tasks, which are lexically and structurally constrained translation. Our motivation is to decompose the generation of the whole sequence into the arrangement of constraints and the generation of free tokens, which can be learned through a sequence-to-sequence framework. Experiments demonstrate that the proposed method can achieve high translation quality and match accuracy simultaneously and our inference speed is comparable with unconstrained NMT baselines.

Limitations
A limitation of this work is that our method can not cope with one-to-many constraints (e.g., ⟨bank, 河岸|银行⟩). Moreover, we only validate the proposed template-based framework in machine translation tasks. However, constrained sequence generation is vital in many other NLP tasks, such as table-to-text generation (Parikh et al., 2020), text summarization (Liu et al., 2018), and text generation (Dathathri et al., 2020). In the future, we will apply the proposed method to more constrained sequence generation tasks.

A.1 More Details on Data
For the lexically constrained translation task, Chinese sentences are segmented by Jieba 6 , while English and German sentences are tokenized using Moses (Koehn et al., 2007). The tokenized sentences are then processed by BPE (Sennrich et al., 2016) with 32K merge operations for both the two language pairs. We detokenize the model outputs before calculating the sacreBLEU.

A.2 More Details on Model
We adopt Transformer (Vaswani et al., 2017) as our NMT model. For English-Chinese, we use the base model, whose depth is 6, and the width is 512. For English-German, we use the big model, whose depth is 6, and the width is 1024. The base and big models are optimized using the corresponding learning schedules introduced in Vaswani et al. (2017). We train base models for 200K iterations using 4 NVIDIA V100 GPUs and train big models for 300K iterations using 8 NVIDIA V100 GPUs. Each mini-batch contains approximately 32K tokens in total. All the models are optimized using Adam (Kingma and Ba, 2015), with β 1 = 0.9, β 2 = 0.98 and ϵ = 10 −9 . In all experiments, both the dropout rate and the label smoothing penalty are set to 0.1. The beam size is set to 4.

A.3 Effect of Alignment Model
In this work, we use an alignment model to produce word alignments for the training set, which is then used for phrase table extraction. By default, we use all the parallel data in the training set to train the alignment model, using the fast-align toolkit. To better understand the effect of the alignment model, we replace the default alignment model with a weaker one that is trained using only 0.1M sentence pairs. Table 5 shows the result, from which we find that using the weaker word alignment can negatively affect the BLEU score. However, the exact match accuracy is still 100%, and changes in the other two metrics are modest.

A.4 Domain Robustness
Domain robustness is about the generalization of machine learning models to unseen test domains (Müller et al., 2020   all the involved models are trained in the news domain. We evaluate the domain robustness of these methods on the WMT21 terminology translation task (Alam et al., 2021b) 7 , which lies in the COVID-19 domain. Since this task does not support English-German translation, we only conduct this experiment on English-Chinese. In this test set, the maximum number of constraints is 12. We thus modify the phrase extraction script to increase the maximum number of constraints from 3 to 12, and then re-train both the baselines and our models. Note that we only change the number of constraints, while the training domain is still news. Since the open-sourced implementation of AttnVector (Wang et al., 2022) 8 does not support more than 3 constraints, we omit this baseline in this experiment. The test set of the WMT21 terminology translation task also contains some constraints that consist of more than one target term (i.e., one-to-many constraints). We only select the one that appear in the reference as our constraint. We leave it to future work to extend the current framework for one-to-many constraints.  achieve much lower exact match accuracy due to the domain shift. However, the BLEU score of VDBA is lower than other constrained translation approaches, while our method can also achieve the best BLEU score. The exact match accuracy of TextInfill (Xiao et al., 2022) is lower than 100% because sometimes the model can not generate all the slots within the length limitation. The results indicate that our approach can better cope with constraints coming from unseen domains.

A.5 X-English Translation
We also conduct experiments on X-English translation directions (i.e., Chinese-English and German-English). Due to the limitation of computational resources, we only train the two most recent baselines: AttnVector (Wang et al., 2022) and TextInfill (Xiao et al., 2022). Moreover, AttnVector and TextInfill achieve the best BLEU score and exact match accuracy, excluding our approach, respectively. As shown in Table 7, we find that our approach performs well in both Chinese-English and German-English, achieving 100% exact match accuracy and a better BLEU score.

A.6 Case Study
As mentioned in Section 4.2, our approach outperforms the baselines in the lexically constrained translation task. To better understand the difference between our approach and some representative baselines, we list some examples in Table 8.

B.1 More Details on Model
All the models are trained for 40K iterations in all the four translation directions. We adopt the cosine learning rate schedule presented in Wu et al. (2019), but we set the maximum learning rate to 7 × 10 −4 and the warmup step to 8K. The period of the cosine function is set to 32K, which means that the learning rate decays into the minimum value at the end of the training. Both the dropout rate and the label smoothing penalty are set to 0.2. Each mini-batch consists of approximately 32k tokens in total. We use Adam (Kingma and Ba, 2015) for model optimization, with β 1 = 0.9, β 2 = 0.98 and ϵ = 10 −9 . We also set the weight decay coefficient to 10 −3 . Both the baseline models and our models are trained using the same hyperparameters.

B.2 Case Study
We list some translation examples in Table 9 to provide a detailed understanding of our work. The examples demonstrate that our approach can effectively cope with structured inputs.
Constraints ⟨guests ,来宾 ⟩; ⟨culinary culture ,食品文化 ⟩; ⟨Chinese-style ,中式 ⟩ Source Wang Kaiwen , Chinese ambassador to Latvia , introduced to the guests a few major styles of cooking in Chinese gourmet foods and expressed his hope that through tasting Chinese-style gourmet foods more will be learned about China and Chinese culinary culture.   (Wang et al., 2022) and TextInfill (Xiao et al., 2022) since they achieve the best BLEU score and the highest exact match accuracy, respectively, excluding our approach. In the first example, AttnVector omits the target constraint 食品文化 in its output, while both TextInfill and our approach can generate all the three constraints. In the second example, TextInfill places the constraint 吉曾柯 in the wrong context, while our approach outputs a better result.