Improving Lexically Constrained Neural Machine Translation with Source-Conditioned Masked Span Prediction

Accurate terminology translation is crucial for ensuring the practicality and reliability of neural machine translation (NMT) systems. To address this, lexically constrained NMT explores various methods to ensure pre-specified words and phrases appear in the translation output. However, in many cases, those methods are studied on general domain corpora, where the terms are mostly uni- and bi-grams (>98%). In this paper, we instead tackle a more challenging setup consisting of domain-specific corpora with much longer n-gram and highly specialized terms. Inspired by the recent success of masked span prediction models, we propose a simple and effective training strategy that achieves consistent improvements on both terminology and sentence-level translation for three domain-specific corpora in two language pairs.


Introduction
Despite its recent success in neural machine translation (NMT) (Wu et al., 2016;Johnson et al., 2017;Barrault et al., 2020), delivering correct terms in the translation output is still a vital component for high-quality translation. This concern becomes more salient in domain-specific scenarios, such as in legal documents, where generating correct and consistent terminology is key to ensuring the practicality and reliability of machine translation (MT) systems (Chu and Wang, 2018;Exel et al., 2020).
To address this, lexically constrained NMT works have proposed various methods to preserve terminology in translations as lexical constraints with or without the help of a term dictionary at test time. In most lexically constrained NMT setups, datasets and terms used for training and evaluating the methods are extracted from WMT news corpora (Dinu et al., 2019;Susanto et al., 2020; Chen * equal contributions et al., 2020). Since the terms, regardless of their source, can only be utilized as long as they exist in the corpus, the term coverage solely depends on the choice of the corpus. By analyzing the previous setups carefully, we discover that the terms found in WMT are mostly uni-or bi-grams (see Figure 1) and highly colloquial (see Table 1 for the top 10 most frequent terms). These leave the question of whether the previous methods are effective in domain-specific scenarios where accurate terminology translation is truly vital.
In this paper, inspired by the recent masked span prediction models, which have demonstrated improved representation learning capability of contiguous words (Song et al., 2019b;Joshi et al., 2019;Lewis et al., 2020;Raffel et al., 2020), we propose a simple yet effective training scheme to improve terminology translation in highly specialized domains. We specifically select two highly specialized domains (i.e., law and medicine) which contain domain-specific terminologies to address more challenging and realistic setups, in addition to applying it to both typologically similar and dissimilar pairs of languages (German-English (De→En) and Korean-English (Ko→En)). Thanks to its sim-plicity, the proposed method is compatible with any autoregressive Transformer-based model, including ones capable of utilizing term dictionaries at training or test time. In domain-specific setups where longer n-gram terms are pervasive, our method demonstrates improved performance over the standard maximum likelihood estimation (MLE) approach in terms of terminology and sentence-level translation quality. Our code and datasets are available at https://github.com/wns823/NMT_SSP.

Background
Lexically constrained NMT We could group lexically constrained NMT methods into two streams: hard and soft. The hard approaches aim to force all terminology constraints to appear in the generated output. The methods include replacing constraints (Crego et al., 2016), constrained decoding (Hokamp andLiu, 2017;Chatterjee et al., 2017;Post and Vilar, 2018;Hasler et al., 2018), and additional attention heads for external supervision (Song et al., 2020). Although those approaches are reliable and widely used in practice, they typically require a pre-specified term dictionary and an extra candidate selection module if there are multiple matching candidates for a single term (see caption in Table 2).
Several soft methods address this problem without the help of a term dictionary, one of which is training on both constraint pseudo-labeled (with statistical MT) and unlabeled data (Song et al., 2019a). More recently, Susanto et al. (2020) andChen et al. (2020) proposed methods that do not assume any word alignment or dictionary supervision at training time to handle unseen terms at test time. For their flexibility, we choose them as our baselines. As discussed in Section 1, most previous methods are trained and evaluated on general domain corpora. In this work, we instead tackle highly specialized domain-specific corpora such as law and medicine, where the terms are much longer and often rare.
Domain-specific NMT Another line of research related to our problem is domain-specific NMT, where difficulties arise from both a limited amount of parallel data and specialized lexicons. Similar to the hard approaches in lexically constrained NMT, several works rely on domain-specific dictionaries (Zhang and Zong, 2016a;Hu et al., 2019;Thompson et al., 2019;Peng et al., 2020) when generating translations, but they are also prone to the same is-  sues. Other domain-specific NMT methods include unsupervised lexicon adaptation (Hu et al., 2019), synthetic parallel data generation with monolingual data (Sennrich et al., 2016a), and multi-task learning that combines language modeling and translation objectives (Gulcehre et al., 2015;Zhang and Zong, 2016b;Domhan and Hieber, 2017). Our method is a form of multi-task learning by utilizing both the source and target language text for an additional task, while the previous works mostly use only the target language text.
Span-based Masking Span-based masking is to predict the spans of masked tokens, as opposed to individual token predictions in BERT (Devlin et al., 2019). With this training objective, the model showed improved performance on span-level tasks including question answering and coreference resolution (Joshi et al., 2019). Concurrently, autoregressive sequence-to-sequence pre-trained models also utilized span-based masking as their objectives and demonstrated its effectiveness in many downstream tasks (Song et al., 2019b;Lewis et al., 2020;Raffel et al., 2020). Similar to theirs, our training scheme takes advantage of autoregressive span-based prediction but we condition on both the source language and the previous non-masked target language tokens.

Source-Conditioned Masked Span Prediction
We posit that adopting auxiliary span-level supervision in generation can benefit both short and long terminology and sentence-level translation. We, therefore, propose an extra span-level prediction task in translation-namely, source-conditioned masked span prediction (SSP). Different from the recent sequence-to-sequence pre-trained models (Song et al., 2019b;Lewis et al., 2020;Raffel et al., 2020), our approach applies span masking only on the target side. By conditioning on the full context of the source language and the previous non-masked target language tokens (due to autoregressive decoding), the model is forced to predict the spans of missing tokens given fully referenced information in the encoder and partially in the decoder.
Span masking We follow the masking procedure proposed in SpanBERT (Joshi et al., 2019), where we first sample the length of spans from a clamped geometric distribution (p=0.2, max=10) and then corrupt 80% of masked tokens with [MASK], 10% with random tokens, and 10% unchanged. We set the corruption ratio to 50%.
Multi-task Learning As our training scheme consists of two objectives (i.e. translation and masked span prediction), we define the total training objective as follows. Let θ be the model parameter and C be the term-matched corpus where each sentence contains at least one or more terms. The first objective, translation, is to maximize the likelihood of the conditional probability of y: where y = (y 1 , . . . , y T ) is the target ground-truth (GT) sentence with length T and x = (x 1 , . . . , x S ) is the source sentence with length S. For the SSP objective, we first corrupt random spans of y until the corruption ratio, resulting inỹ. Then we autoregressively predict the masked tokensȳ while  Table 2: Statistics of the filtered corpus and matched terms. Note that # unique terms in the source (SRC) and target (TRG) languages are not the same. For instance, "Arzneimittel" can translate into multiple forms-"pharmaceutical products", "drug", "medicinal product", etc.-depending on the context. conditioned on bothỹ and x: where m t = 1 means y t is masked. Finally, we simultaneously optimize the joint loss: where C is a span-level corrupter and γ is a task coefficient that weights the relative contribution of SSP.   (3)) outperforms other methods in most cases. Note that GU19 is a non-autoregressive model, therefore not applicable to our proposed method. Higher Term% and LSM-3 mean better performance.
Model   Table 3. We argue that providing terms at test time is indeed helpful for terminology generation, but it can often hinder the generation of fluent text. This becomes more apparent in our non-autoregressive setup.
one term translates into multiple terms, we consider all possible pairs to maximize the number of sentence and term matches.
To avoid trivial matches between the parallel sentences and terms, we filtered out terms that are less than four characters long and longer than 20 grams. Sentences that do not contain any term are also removed. The statistics of the datasets are reported in Table 2. More details about the preprocessing steps are in Appendix A.1.
For data splitting, we developed a new data splitting algorithm that considers the same distribution of n-grams across each data split. We use 3,000 sentences for valid and test sets in case of high redundancy in certain corpora, while previous works that utilize OPUS use only 2,000 (Koehn and Knowles, 2017;Müller et al., 2020). It is important to note that all the sentences in our data splits are matched with domain-specific terms (i.e. at least one or more terms exist in each sentence) following the style of Dinu et al. (2019). The pseudo-code for the terminology-aware data split algorithm is in Appendix B.
Baselines We compare our method on two recent lexically constrained NMT models of different natures: autoregressive (Chen et al., 2020) and non-autoregressive (Susanto et al., 2020), but both can operate with or without a term dictionary at test time. We refer to them as CHEN20 and SU-SANTO20, respectively. +SSP indicates models trained with our proposed training scheme, while no indication is the standard MLE method. A base Transformer (Vaswani et al., 2017), denoted as VASWANI17, and a Levenshtein Transformer (Gu et al., 2019), denoted as GU19, are also reported to compare the relative performance between models. SUSANTO20 without a dictionary is equivalent to GU19.
Evaluation We use SacreBLEU 5 (Post, 2018) for measuring translation quality. For terminology translation, we use term usage rate for both short (≤2-grams) and long (>2-grams) terms. Term usage rate (Term%) is the number of generated terms divided by the total number of terms (Dinu et al., 2019;Susanto et al., 2020). Specifically for evaluating long terms, we report both the macro and micro averages due to the heavy-tailed nature of ngrams. In addition, although exact term translation is the primary objective for terminology translation, due to its harshness, evaluating models only with Term% may not fully describe the models'  behavior. Therefore, we also evaluate each model in terms of partial n-gram matches, which is explained in the next paragraph. All evaluations are conducted with a beam size of 5.
Partial N-gram Match Inspired by the longest common substring problem (Gusfield, 1997), we devised a partial n-gram match score for evaluating long terminology-longest sub n-gram match (LSM) score. Formally, let the generated target sentence beŷ = (ŷ 1 , ...,ŷ T ) and the matched terms for the target ground truth (GT) sentence y be y = N i=1 (y i1 , ..., y il ), where N is the number of GT terms in y and l is an arbitrary n-gram length for i-th term. Then, LSM is defined as the ratio of the longest n-gram overlap divided by l. As too many overlaps can occur at the uni and bi-gram levels, we only calculate LSM for long terminology, which means the least overlap has to be greater than or equal to 3 grams, all else being zero, therefore denoted as LSM-3.

Results and Analysis
For the legal domain, where many terms are exceptionally long compared to most other domains, our training scheme shows consistent improvements over the standard MLE counterparts, as shown in Table 3 and Table 4. Even with the extreme setting of law Ko→En, low-resourced and typologically divergent, our method is still effective in most metrics we use. Compared to the autoregressive models, GU19 and SUSANTO20 did not achieve competitive BLEU scores in our domain-specific setup. We suspect that this is due to both its complex decoding nature and the small amount of training data (originally WMT). Sampled translation results are reported in Table 9.
For the medical domain, the behaviors of two baselines, VASWANI17 and CHEN20, are not clearly shown compared to the legal domain. However, our training scheme shows consistent improvements in BLEU and Term% at 2>micro which reflects the global performance of long terminology generation. Similar to the legal De→En results, SUSANTO20 shows better performance on several metrics on long terminology translation, but the BLEU score is decreased by about 8 points, compared to no dictionary use.

Conclusion
We propose a simple and effective training scheme for improving lexically constrained NMT by introducing the masked span prediction task on the decoder side. Our method shows its effectiveness in terms of terminology and sentence-level translation over the standard MLE training in highly specialized domains in two language pairs. As we publicly release our code and datasets, we hope that more people can join this area of research without much burden. In the future, we plan to further investigate applying our method to non-autoregressive methods.  for (x', y') in D do 12: if y' in y and x' in x then 13: ngramlist.append(ngram(y')) 14: y = y.replace(y', "", 1) 15: x = x.replace(x', "", 1) 16: T".append((x', y')) 17: Line 2 : Initialize a dictionary for storing paired sentences.
The keys are the longest n-gram lengths for each sentence w.r.t. the target language.
Line 3 : Initialize a dictionary for storing matched terms. The keys are the indices of a corresponding sentence.
Line 13 : ngram() returns the token length of a term. In our case, it is used for calculating the length of a target language token y'.
Line 14 : Replace y' with "" in y to avoid unwanted substring duplication (e.g., In case of having "public officer" and "officer" in a sentence, we would like to first match "public officer" instead of "office" when we have "public officer" in the dictionary. See Line 1).
Line 19 : Calculate the maximum length of n-grams in y.
Line 23 : Store the sentence index w.r.t. its longest length of n-grams.
Line 24 : Store the list of terms w.r.t. its sentence index.
Line 28 : DuplicateCheck() checks for duplication in the corpus and returns duplicate and non-duplicate indices. Note that i dk is a list of duplicate sentence indices, and S uk is a list of unique sentence indices.
Line 29 : Distributor Dup() first calculates the number of sentences and phrases to be distributed across train, valid, and test sets following the ratio in S, and then distributes sentences accordingly.
Line 30 : Distributor Uni() distributes unique sentences and phrases alternatively between train, valid, and test sets.

C Removing duplicates
As the OPUS datasets contain duplicate sentences (Aharoni and Goldberg, 2020), we further evaluate each model with unseen, unique test samples only. Similar to Tables 7 and 8, our training scheme outperforms its MLE counterparts. The Ko-En law corpus does not contain any duplicate sentence, and therefore the results are equivalent to those in Tables 3 and 4.     Table 9 shows translation results of the baselines and our method.