Disambiguated Lexically Constrained Neural Machine Translation

Lexically constrained neural machine translation (LCNMT), which controls the translation generation with pre-specified constraints, is important in many practical applications. Current approaches to LCNMT typically assume that the pre-specified lexical constraints are contextually appropriate. This assumption limits their application to real-world scenarios where a source lexicon may have multiple target constraints, and disambiguation is needed to select the most suitable one. In this paper, we propose disambiguated LCNMT (D-LCNMT) to solve the problem. D-LCNMT is a robust and effective two-stage framework that disambiguates the constraints based on contexts at first, then integrates the disambiguated constraints into LCNMT. Experimental results show that our approach outperforms strong baselines including existing data augmentation based approaches on benchmark datasets, and comprehensive experiments in scenarios where a source lexicon corresponds to multiple target constraints demonstrate the constraint disambiguation superiority of our approach.


Introduction
Lexically constrained neural machine translation (LCNMT) is a task that guarantees the inclusion of specific lexicons in the translation, which is of great importance in many applications such as interactive translation with user-given lexicon constraints (Koehn, 2009), domain adaptation with pre-specified terminology constraints (Hasler et al., 2018).Accurate lexicon translation plays a key role in improving translation quality.However, in real world applications, a source lexicon often has multiple translation constraints, which are provided by a specific database and represent different but core concepts.It is essential for a translation model to select the most contextually appropriate constraint and force it to appear in the trans-Figure 1: An example of the constraint ambiguity problem in English-to-Chinese translation.Given a lexical constraint inventory, the lexicon airway has three possible translations as the ambiguous constraints: respiratory tract, airline, and ventiduct, among which respiratory tract is the context appropriate one for the input sentence.
Table 1: The frequency of the constraint ambiguity problem in the validation sets of German-to-English(De-En) and English-to-Chinese(En-Zh) translation tasks.lation, but such constraint disambiguation process is largely ignored in previous LCNMT researches.They just use the aligned target lexicons appeared in the translation reference of a given source sentence as the constraints and bypass the constraint ambiguity problem (Dinu et al., 2019;Song et al., 2019;Wang et al., 2022a,b).In this paper, we propose disambiguated LCNMT (D-LCNMT) to solve the constraint ambiguity problem when facing a source sentence, and investigate how to integrate the disambiguated constraints into NMT.Figure 1 presents an example of the constraint ambiguity problem.Table 1 presents the frequency of the problem in the validation sets, showing that the ambiguous constraints account for more than half of the total constraints.Despite the severity of the problem, it is overlooked by most LCNMT researches which only use gold constraints.The problem is brought into the spotlight only at recent WMT2021 shared task on machine translation using terminologies, where a source terminology has averagely 2.22 possible translation constraints.
Major works in this task apply data augmentation approach, which builds synthetic corpora containing ambiguous constraints via code-switching, and train the NMT models to select the most contextually appropriate constraint implicitly (Wang et al., 2021;Ailem et al., 2021).
Instead, our D-LCNMT adopts an explicit twostage framework that performs constraint disambiguation and integration into NMT sequentially, and outperforms the above data augmentation approach on benchmark datasets.In particular, at the first stage, we build a constraint disambiguation network based on contrastive learning so that the correct constraint is selected given the source lexicon and its context in the given source sentence.At the second stage, we integrate the most appropriate constraint obtained in the first stage into NMT with the help of current lexically constrained approaches (Wang et al., 2022a,b).Experiments on disambiguated lexically constrained translation tasks in German-to-English and Englishto-Chinese show that our approach significantly outperforms strong baselines including the data augmentation approach.For lexicons that have multiple possible constraints, our approach achieves state-of-the-art accuracy of constraint disambiguation, especially ranks the first in the leaderboard of WMT2021 shared task on machine translation using terminologies.Overall, our contributions are three-fold: 1. We propose D-LCNMT which is a robust and effective two-stage framework that disambiguates the constraints at first, then integrate the constraints into LCNMT.
2. We propose a continuous encoding space with contrastive learning for constraint disambiguation, which is a problem overlooked by major LCNMT researches which use gold constraints.

Through extensive evaluation and comparison
to other approaches, we achieve the best constraint disambiguation accuracy, and maintain or achieve higher sentence level translation quality.

Related Work
We introduce LCNMT at first, then introduce the related constraint disambiguation researches.

LCNMT
LCNMT controls the translation output of an NMT model to satisfy some pre-specified lexical constraints.The lexical constraints are usually provided by users or deposit dictionaries covering wide range of topics and domains, showing great values in practical applications.One line of LCNMT studies focuses on designing constrained decoding algorithm (Hasler et al., 2018).For example, Hokamp and Liu (2017) firstly proposed grid beam search (GBS), which added an additional dimension of the number of constrained lexicons at each decoding step.Post and Vilar (2018) proposed a dynamically beam allocating (DBA) strategy for constrained decoding, which fixed the beam size and made it unaffected by the number of constrained lexicons.Then, Hu et al. (2019) extended it into vectorized dynamic beam allocation (VDBA) that supports batched decoding.Although these constrained beam search methods have high control over the target constraints, they significantly slow down the decoding speed and tend to reduce the fluency of translation (Hasler et al., 2018).
Another line of studies addresses the problem by augmenting the training data with placeholders or additional translation constraints.Crego et al. (2016) proposed to replace entities with placeholders, which remained in the system output.They are placed back through post-processing.Song et al. (2019) replaced the source lexicons with the corresponding target constraints, and Dinu et al. (2019) appended the target constraints right after the corresponding source lexicons.During inference, the target constraints are imposed on the source sentences similarly.The main disadvantage of these methods is that they do not guarantee the appearance of the target constraints in some cases (Chen et al., 2021).
Different from the above decoding and synthetic data approaches, models of constrained neural networks were also explored.Song et al. (2020) trained an alignment-enhanced NMT model and conducted alignment-based constrained decoding, but they required alignment labels from external aligners with noisy alignments.Susanto et al. (2020) proposed to invoke constraints using a nonautoregressive approach, while the constraints must be in the same order to that in reference.Wang et al. (2022b) vectorized source and target constraints into continuous keys and values and integrated

Constraint Disambiguation
The above studies on LCNMT assume that the prespecified lexical constraints are gold ones.For a source sentence, the constraints are simulated by being directly extracted from the target sentence.Such simulation is not practical when a source lexicon has multiple possible translations as constraints, and the target sentence is not known when translating an input source sentence.This ambiguous constraint problem for LCNMT is noticed by researchers at the WMT2021 shared task on machine translation using terminologies, where certain terminologies have multiple possible translations as the ambiguous constraints.Ailem et al. (2021) solve the problem by selecting terminology translations at random and insert them as constraints in the source sentence.Wang et al. (2021) propose to augment source sentence with all possible terminology translations, which is different from Ailem et al. (2021) who kept only one.These data augmentation methods do not explicitly disambiguate the constraints.They just train an NMT model to generate correct sentence level translations given the augmented source sentence.Unlike previous works, we propose an explicit constraint disambiguation module to select the most contextually appropriate constraint.

D-LCNMT
We propose D-LCNMT to solve the ambiguous constraint problem for LCNMT through a twostage framework.At Stage 1, we introduce a contrastive learning based constraint disambiguation neural network.At Stage 2, we integrate the disambiguated constraints into current competitive LCNMT models (Wang et al., 2022a,b).

Stage 1: Constraint Disambiguation
In a lexical constraint inventory, which is provided by either users or dictionaries, a source lexicon may have multiple possible translations.Let s denotes a source lexicon, its ambiguous translations are m (1) , ..., m (K) .The constraint disambiguation is needed to select one appropriate translation given s and its source context C s in the input source sentence.
The constraint disambiguation neural network is shown in Fig. 2. The main goal is to encode the source lexicons with contexts and the corresponding target side candidate constraints into the same representation space, so that the source lexicons and their correct constraints are closest neighbors in the space.Briefly, the network consists of a context encoder and a constraint encoder.In the source side, the context encoder captures the semantic information of the lexicons and their contexts at the same time.In the target side, the constraint encoder considers all possible candidate constraints for a source lexicon and encode each variable-length candidate into a single representation.

Context Encoder and Constraint Encoder
Both encoders are independent of each other.Each of them consists of two transformer encoder layers stacked with one adaptation layer.For either the source lexicon or its translation constraint, we add a special token [CLS] in front of it.The hidden state of [CLS] outputted by the encoder is used as its representation.
For a considered source lexicon, we concatenate it with the source sentence by adding a special token [SEP] and feed the concatenation to the context encoder to obtain the representation of the source lexicon.The structure is shown in Fig. 3. Notably, the source lexicon is masked in the source sentence to let the encoder better encode the context of the lexicon.The positions of the lexicon and the sentence are independently countered.For each translation constraint, we directly feed it to the constraint encoder, and get the hidden state of [CLS] as its representation.
In each encoder, the adaptation layer is stacked over the transformer layers for further optimizing the hidden state of [CLS].The adaptation layer consists of two linear transformations and a tanh activation in between (Wang et al., 2022b).Let h s ∈ R d×1 and h m (k) ∈ R d×1 be the two hidden states of [CLS] outputted by the transformer layers in the source and target side, respectively.The final outputs of the context encoder and the constraint encoder are defined as: where W • ∈ R d×d presents the trainable linear transformations.
Contrastive Learning Contrastive learning can learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors (Gao et al., 2021;Pan et al., 2021).
We adopt the contrastive objective to train the disambiguation network.For a given parallel sentence pair, we treat the source lexicon s and its translation t in the target sentence as positive constraint sample, and treat s and its other candidate translations as negative constraint samples.Let e s , e t be the representation of s and t, respectively.The training loss for each sentence pair is: ) . (2) where sim(•) denotes the cosine function, N is the number of constraints contained the training parallel sentences.In practice, there are some source lexicons having too many or few candidate translations, which may affect the performance of the contrastive learning.To address this issue, for each of such source lexicons, we randomly select K candidate translations of it derived from the predefined inventory as negative samples.If a source lexicon has less than K candidate translations, we randomly select other translations from the training batch to complement K negative samples.During inference, we calculate the cosine similarity between the source representation and each constraint candidate representation, and select the one with the highest cosine similarity as the disambiguated constraint.

Stage 2: Integrating Disambiguated Constraints into LCNMT
At Stage 2, we choose two recent competitive LC-NMT systems, which are originally developed for integrating gold constraints, to integrate our disambiguated constraints.One is VecConstNMT (Wang et al., 2022b), which is based on constraint vectorization and outperforms several strong baselines.However, we found that VecConstNMT failed in copying long constraints integrally due to its wordby-word generation nature.To address this issue, we propose an integrity loss and a decoding strategy to ensure the appearance of the long constraints in translation.The other is template-based LC-NMT (Wang et al., 2022a), which achieves high translation quality with 100% success rate of generating the constraints.So we simply feed the disambiguated constraints directly into the templatebased LCNMT.
Integration into VecConstNMT VecConstNMT splits the translation probability into two subparts: P model and P plug , where P model is the conventional form of Transformer probability, P plug is the probability tailored for the lexical constraints.Suppose a sentence pair ⟨x, y⟩ with N lexical constraints (s N 1 , t N 1 )1 : where h i ∈ R d×1 is the hidden state of the i-th step from the last decoder layer, W ∈ R d×|V| is the embedding matrix, and w y ∈ R d×1 is the word embedding of token y.P plug encourages the similarity between h i and w y for tokens inside the constraints.Such formula has the problem of keeping the integrity of long constraints.It is possible that the cosine similarity between h i and a word embedding from a wrong position is too high, causing the wrong token to appear in the i-th position.However, for long constraints, we have to ensure that all constraint tokens appear in the correct positions.To address this issue, we propose the integrity loss: where C is the window size.For each target token y in the constraints, we use C hidden states from the history and C hidden states from the future as negative examples, our purpose is to prevent y appears earlier or later in the translation.Finally, the training objective for VecConstNMT is: The hyperparameter λ is used to balance the original VecConstNMT loss and the integrity loss.
To further ensure the integrity of long constraints, we also propose gated decoding algorithm (GDA) for inference without sacrificing decoding speed.GDA tracks the decoding progress of each constraint and optimizes translation probability by a gating mechanism.The algorithm is presented in appendix A.1 due to space limit.

Integration into The Template-based LCNMT
The template-based LCNMT (Wang et al., 2022a) uses the templates to simplify a sentence by disentangling different parts with different special tags.Formally, given a sentence pair and its N lexical constraints, the template format is: where C 1 , ..., C N denote the slots for the source side constraints in order, similarly for C i1 , ..., C iN in the target side.C n and C in do not necessarily constitute a phrase pair.There is alignment between C 1 , ..., C N and C i1 , ..., C iN that manifests the position relations between the constraints in the sentence pair.The N lexical constraints divide the sentence pair into N + 1 textual fragments in each side, denoted by the nonterminals of X 0 , ..., X N in the source side and Y 0 , ..., Y N in the target side.
The template provides clear configuration of the sentence pair.Since it reserves the slots for the constraints, the template based LCNMT guarantees the generation of the integral long constraints in the translation result.By using the slots for the constraints, we directly feed them the disambiguated constraints outputted by Stage 1 in the template based LCNMT at Stage 2.

Experiments
We conduct experiments on German-to-English (De-En) and English-to-Chinese (En-Zh) lexically constrained translation tasks.Different to major works on LCNMT that only use gold constraints, our experiment focuses on more practical scenario

Lexical Constraints
There are usually two ways to build the lexical constraints.One way is the simulation method adopted in most LCNMT researches (Chen et al., 2021;Wang et al., 2022a,b).They simulate the lexical constraints by extracting parallel phrases from the parallel sentences in both training and testing sets, and randomly selecting some parallel phrase as the lexical constraints.Such simulation method is not practical since we do not have parallel sentences during testing.In practice, it is usual that some source phrases have multiple possible translations, and constitute the ambiguous constraints.So, we simulate this practical scenario by collecting all possible translations of a considered source phrase as the ambiguous constraints.We study such simulated constraints in De-En.The other way is the human labeling method.WMT 2021 shared task on machine translation using terminologies provides manual translations of the source terminologies as the constraints.In comparison to the simulation method that is based on automatic word alignment and phrase extraction, the human labeling method builds the lexical constraints with higher quality.We study such human labeled constraints in En-Zh.Since the size of the human labeled terminology translation dictionary is too small for En-Zh training, we use the same strategy as the simulation method to extract the constraints in the training set.Following Wang et al. (2022b), the number of constraints for each sentence in the training set is up to 3.
Both the simulated constraints (in De-En experiment) and the human labeled constraints (in En-Zh experiment) have the ambiguity phenomena as shown in Table 2.It shows that the sentence pairs containing ambiguous constraints account for majority of the sentence pairs that have constraints, indicating the wide spread of the ambiguous constraint phenomena.We are the first to conduct comprehensive studies on the constraints built by the two ways.

Baselines
We compare the proposed framework with the following baseline methods: • Vanilla We directly train a Transformer model (Vaswani et al., 2017) to translate, which is an unconstrained baseline.
• Random + Stage2 Vec.At Stage 1, we randomly select one constraint from the ambiguous constraints for each considered source lexicon.At Stage 2, we inject the constraints of Stage 1 into VecConstNMT (Wang et al., 2022b).
• Most-Fre.+ Stage2 Vec.At Stage 1, for each considered source lexicon, we select its most frequent constraint in the training set as the constraints for VecConstNMT at Stage 2. • Ambiguous Vec.We directly feed all constraints for each considered source lexicon into VecConstNMT.This baseline does not explicitly disambiguate the constraints.• Ambiguous Code-Switch Similar to Song et al. (2019), we use the synthetic codeswitching corpus to train the LCNMT model, the difference is that we use all constraints seperated by [SEP] to replace the corresponding source lexicon.
• TermMind We use the data augmentation approach of TermMind, which is the winning system of WMT2021 machine translation using terminologies task (Wang et al., 2021).It fuses ambiguous constraints into source sentences by special tags and masks source lexicon to strengthen the learning of constraints.

Evaluation metrics
The evaluation includes constraint level and sentence level metrics.In the constraint level, we use metrics such as exact-match accuracy, which measures the appearance rate of the whole constraints in the translation results.In the sentence level, we use case-sensitive SacreBLEU (Post, 2018).Details of other metrics, including window overlap accuracy, terminology-biased translation edit rate (TERm), and CSR can be found in appendix A.2.

Results
Table 3 presents the performances on the test sets of De-En and En-Zh.In each language pair, the top part lists the baseline performances, and the bottom part lists the performances of our two stage approach Stage1 + Stage2 Vec./Tem.It shows that our approach consistently outperforms baselines in both language pairs, especially leads a wide margin in constraint level evaluations.At the same time, our approach maintains or achieves higher sentence level SacreBLEU.Regarding two important constraint level metrics of exact match and CSR, which reflect the hard and soft accuracy of the constraints appeared in the translation result, our approach generally outperforms the strong baselines, including the strong data augmentation approach TermMind.The improvements are averagely nine points in exact match and averagely seven points in CSR.This indicates that our constraint disambiguation is effective that more accurate constraints are generated in the translation compared to the baselines or existing approaches, leading to significantly better user experience since the constraints usually carry key information.
The effect of the constraint disambiguation at Stage 1 is shown in the comparison between our approach and Random+Stage2 Vec./Tmp.or Most-Fre.+Stage2Vec./Tmp., which randomly select the constraint or select the most frequent constraint at Stage 1, respectively.No matter which one we use from VecConstNMT or the template based LC-NMT at constraint disambiguation at Stage 1 is consistently better than the two baselines.Furthermore, our two stage approach with explicit constraint disambiguation at Stage 1 also performs significantly better than the baselines of conducting implicit disambiguation, i.e., Ambiguous Vec., Ambiguous Code-Switch, and TermMind.They just train the sequence-to-sequence model to implicitly select the appropriate constraints from all possible constraints.Regarding the comparison between VecConstNMT and the template based LCNMT at Stage 2, the template based one performs significantly better under the premise of the same Stage 1.Besides the constraint level evaluation, our two stage approach achieves better SacreBLEU on De-En and En-Zh than all data augmentation based approaches, including Ambiguous Code-Switch and TermMind.
On Ambiguous Constraint Test Sets As shown in Table 2, not all constraints are ambiguous.To strictly investigate the effectiveness of our constraint disambiguation approach, we delete the sentence pairs that do not contain ambiguous con- Comparison to WMT 2021 Shared Task Participants We also compare our approach with the systems submitted to WMT 2021 shared task on machine translation using terminologies in En-Zh.
The systems are ranked according to Exact Match accuracy.Table 5 shows that our Stage1 + Stage2 Tem.approach outperforms all participants.In addition, it is worth noting that TermMind-Sys2 uses techniques such as backtranslation, fine tuning on pseudo in-domain data and ensembling to enhance the performance of TermMind, while our approach does not add those techniques and only uses a subset of the training set, indicating the superiority of our approach on constraint disambiguation.

Conclusion
In this paper, we propose an effective two-stage framework for disambiguated lexically constrained neural machine translation (D-LCNMT).Our basic idea is to build a continuous representation space for constraint disambiguation at Stage 1, then inject the disambiguated constraints into the vectorized or template-based LCNMT models at Stage 2. Experiments show that our approach is significantly better than various representative systems across De-En and En-Zh translations, showing significant superiority in constraint disambiguation, which is wide spread and important in lexically constrained machine translation.

Limitations
In this paper, we does not specifically discuss morphological problems and polysemy problems, and does not develop special strategies for both problems such as Pham et al. (2021) and Emelin et al. (2020).Besides, the simulated lexical constraint dictionary, which is extracted from the parallel sentences of the training set based on automatic word alignment, may be different from the real lexical constraint dictionary provided by users.

Ethics Statement
D-LCNMT is designed as a machine translation system that can better serve the user pre-specified translation constraints.It can handle ambiguous constraints that are wide spread but neglected in major LCNMT researches.We believe that D-LCNMT would enhance user experience in machine translation services.In addition, the datasets used in our experiments are freely released data from WMT shared tasks.

A Appendix
A.1 GDA Our algorithm uses next-tokens to record the translation of all constraints, which point the first ungenerated token for each constraint.In step i of decoding, we will judge whether the tokens generated in step i − 1 are in next-tokens, if yes and the corresponding constraints are not fully generated either, we will update next-tokens and set the probabilities corresponding to the updated next-tokens in P plug to 1. Unlike GBS, our method is not fully enforced, we use a gating mechanism to balance P model and P plug .More importantly, our method does not hurt the decoding speed.

A.2 Metrics
In the constraint level, we adopt the metrics used in WMT2021 machine translation using terminologies task, including exact-match accuracy, window overlap accuracy and terminology-biased translation edit rate (TERm)2 .The exact-match accuracy measures the appearance rate of the whole constraints in the translation results.The window overlap accuracy measures the position accuracy of the constraints in the context window.TERm is an edit distance based metric for measuring the translation quality, especially tailored for the constraints.Details of these metrics can be referred to Anastasopoulos et al. (2021).In addition, following previous works (Chen et al., 2021;Wang et al., 2022b), we use the percentage of constraints that are successfully generated in translation as CSR, which differs from exact-match in that it does not require matching of whole constraint.In the sentence level, we report case-sensitive sacreBLEU (Post, 2018) to evaluate the translation quality.

A.3 Model Configuration
Our models are implemented in Fairseq Library.At Stage 1, the hidden vector dimension of the disambiguation network is set 512.During training, we use Adam to optimize our models with β 1 = 0.9, β 2 = 0.98 and ϵ = 10 −9 .The max learning rate is 0.0007 and warmup step is 4000.For each source lexicon, we randomly select 5 candidate translations as the negative samples.At Stage 2, following previous works (Wang et al., 2022a,b), both VecConstNMT and the template-based LCNMT consist of 6 encoder layers and 6 decoder layers, and the hidden size is 512.Each multi-head attention module has 8 individual attention heads.
During training, the learning strategy is the same to Stage 1.We set all dropout rates to 0.1.Besides, the hyperparameter λ, which is the weight of L int in VecConstNMT, is set to 1 and the window size C is fixed to 5.During inference, the beam size is set to 4.

A.4 The effects of The Intergrity Loss and GDA for VecConstNMT
Although VecConstNMT outperforms several strong baselines (Wang et al., 2022b), we found that the exact-match accuracy of the original Vec-ConstNMT model decreases dramatically as constraint length increases.Table 7 shows the translation accuracy of the constraints with different lengths.We report CSR, which reflects the performance of the constraints at word level, and report exact-match which reflects the performance of the constraints at whole phrase level.It is obvious that exact-match decreases sharply with the increase of constraint length, but CSR remains stable.This phenomenon indicates that the original VecConst-NMT can translate long constraints at word level, but it cannot make these constrained tokens appear consecutively in the correct order.
To address this problem, we propose the integrity loss during training and GDA during inference.To verify the effect of them on assurance of the long constraints, we experiment by using only the gold constraints in De-En translation.As shown in Table 7, the longer the constraint, the more significant improvement on exact-match can be achieved.In comparison to the original VecConstNMT, our model achieves 10-point improvements on constraints of length 2, and nearly 20-point improvements on constraints longer than 2. When we do not use the integrity loss, exact-match on length < 4 decreases more significantly than not using GDA, indicating the effectiveness of the integrity loss.
A.5 The effect of Adding VDBA to Stage2 Vec.
Following Wang et al. (2022b), we add VDBA to the decoding of our Stage2 Vec.VDBA dynamically devotes part of the beam for constraintrelated hypotheses at inference time, achieving high exact-match of the constraints.Table 8 shows the results of adding VDBA in De-En translation.Exact-match accuracy is significantly improved when VDBA is added.Our Stage1 + Stage2 Vec.+ VDBA achieves the best exact match accuracy, but the window overlap metric and 1-TERm drop by 2.1 and 5.3 compared to Stage1 + Stage2 Vec., respectively.In addition, the introduction of VDBA seriously harms SacreBLEU and slows down the decoding speed.It shows that although adding VDBA improves exact-match, it is harmful to sentence-level performance or constraint level context performances.

A.6 Disambiguation Accuracy
We test the disambiguation accuracy on De-En and En-Zh language pairs.The results is shown in Tab. 9, where "All" presents all constraints in the orginal test sets and "Ambiguous" presents the constraints that each source terminology has muliple translation candidates.In Tab. 9, "Random" and "Most-Fre."denote selecting target translations at random and selecting translations with the highest frequency, respectively.We didn't report the disambiguation accuracy of data augmentation based methods, which cannot yield explicit disambiguation results to compute the accuracy.The experiment results demonstrate that our method outperform all baselines by a significant margin, especially on ambiguous constraint test sets.

Figure 2 :
Figure 2: The constraint disambiguation neural network.Given the source lexicon airway and its context shown in the right, the framework selects the correct constraint respiratory tract from all three candidate constraints by building the common representation space for the source and target sides as shown in the middle.

Figure 3 :
Figure 3: The structure of the context encoder and the constraint encoder.

Table 2 :
Number of sentence pairs in each dataset.'Constrained' denotes the scenario where the sentence pairs contain the constraints, 'Amb.Constrained' denotes the scenario where the sentence pairs contain the ambiguous constraints.

Table 3 :
Main Results on De-En and En-Zh test sets.

Table 4 :
Results on Ambiguous Constraint Test Sets of De-En and En-Zh.

Table 5 :
Comparison between our approach and WMT2021 Machine Translation Using Terminologies Shared Task participants.straints in the test sets.Table4shows SacreBLEU and Exact Match on these new test sets.Full scores are presented in table 6 in the appendix.It exhibits the same trend to Table3with clear advantage of our approach over various baselines, especially in constraint level Exact Match.Our two stage approach is effective in producing correct constraints, performing much better than implicit disambiguation approaches of Ambiguous Vec./Code-Switch and TermMind.

Table 6 :
Results on Ambiguous Constraint Test Sets of De-En and En-Zh.

Table 7 :
Ablation study on VecConstNMT with different length constraints.

Table 8 :
The results of adding VDBA to Stage2 Vec.

Table 9 :
The disambiguation accuracy of different methods.