Towards Accurate Translation via Semantically Appropriate Application of Lexical Constraints

,


Introduction
Lexically-constrained neural machine translation (LNMT) is a task that aims to incorporate pre-specified words or phrases into translations (Hokamp and Liu, 2017;Dinu et al., 2019;Song et al., 2019;Susanto et al., 2020;Xu and Carpuat, 2021a;Chen et al., 2021a,b;Wang et al., 2022b, inter alia).It plays a crucial role in a variety of real-world applications where it is required to translate pre-defined source terms into accurate target terms, such as domain adaptation leveraging domain-specific or user-provided terminology.For example, as shown in Case A of Table 1, an LNMT * Equal Contribution model successfully translates the source term ("ᄏ ᅩ ᄅ ᅩᄂ ᅡ") into its corresponding target term ("Covid-19") by adhering to a given lexical constraint ("ᄏ ᅩ ᄅ ᅩᄂ ᅡ" → "Covid-19").
Despite its practicality, previous studies on LNMT have not evaluated their performances under challenging real-world conditions.In this paper, we focus on two important but understudied issues that lie in the current evaluation process of the previous LNMT studies.
Semantics of lexical constraints must be considered.In previous work, at test time, lexical constraints are automatically identified from the source sentences by going through an automatic string-matching process (Dinu et al., 2019;Ailem et al., 2021;Chen et al., 2021b).For example, in Case B of Table 1, a source term ("ᄏ ᅩᄅ ᅩᄂ ᅡ") in the bilingual terminology is present as a substring in the source sentence.Accordingly, its corresponding target term ("Covid-19") is automatically bound together as a lexical constraint ("ᄏ ᅩ ᄅ ᅩᄂ ᅡ" → "Covid-19") without considering the semantics of the matched source term,1 which can lead to a serious mistranslation.This automatic string-matching cannot differentiate textually identical yet semantically different source terms.Thus, the more accurately the LNMT reflects the lexical constraint, the more pronounced the severity of the homographic issue is.To address this homograph issue, LNMT systems must be equipped to understand the semantics of identified lexical constraints and determine whether or not these constraints should be imposed.
Unseen lexical constraints need to be examined.One desideratum of LNMT systems is their robustness to handle "unseen" lexical constraints, thereby responding to random, potentially neologistic, or technical terms that users might bring up.However,  When lexical constraints are included in the training examples, we find that a well-optimized vanilla Transformer (Vaswani et al., 2017) already satisfies lexical constraints by merely learning the alignment between the source and target terms cooccurring in the parallel training sentences. 2This presents difficulties in identifying whether the presence of target terms in the output is attributed to the learned alignment, or the proposed components in previous studies.Therefore, it is important to control lexical constraints not exposed during training to examine the model's ability to cope with "unseen" lexical constraints.
As a response, we present a test benchmark for evaluating the LNMT models under these two critical issues.Our benchmark is specifically crafted not only to evaluate the performance of LNMT models but also to assess its ability to discern whether given lexical constraints are semantically appropriate or not.To the best of our knowledge, we are the first to release a hand-curated highquality test benchmark for LNMT.Concurrently, we suggest a pipeline that allows researchers in LNMT communities to simulate realistic test conditions that consider the homograph issue and assign "unseen" lexical constraints.
To this end, we propose a two-stage framework to deal with these issues.We first develop a homograph disambiguation module that determines whether LNMT models should apply a given lexi-2 We observe that the vanilla Transformer achieves a 66.67% copy success rate.cal constraint by evaluating its semantic appropriateness.Further, we propose an LNMT model that integrates provided lexical constraints more effectively by learning when and how to apply these lexical constraints.Our contributions are summarized as follows: • We formulate the task of semantically appropriate application of lexical constraints and release a high-quality test benchmark to encourage LNMT researchers to consider realworld test conditions.
• We propose a novel homograph disambiguation module to detect semantically inappropriate lexical constraints.
• We present an LNMT model which shows the best translation quality and copy success rate in unseen lexical constraints.

HOLLY Benchmark
Here, we introduce HOLLY (homograph disambiguation evaluation for lexicallly constrained NMT), a novel benchmark for evaluating LNMT systems in two circumstances; either the assigned lexical constraints are semantically appropriate or not, as illustrated in Table 2.The entire test data includes 600 test examples on 150 Korean → English lexical constraints.
A pregnant woman on the verge of labor due to amniotic fluid breaking gave birth to a healthy child thanks to the help of the airforce.
As my regular gynecologist recommended an amniotic fluid test, I took the test and came back.
In mathematics, while you can omit the '+' symbol when indicating positive, you must mark the '-' one before numbers when meaning negative. (d) It turns out that recently, major shareholders of the listed corporates have active handovers and takeovers.
(There are cases where the amniotic fluid bursts sooner than the expected date of birth.) (The amniotic fluid was tested to find if there were any abnormalities with the fetal chromosomes.)While the source term is a homograph with multiple meanings, one of them is chosen to serve as its lexical constraint. 4Then, based on the meaning of the source term in the source sentence, each test example is classified into one of two groups:

Positive References
As seen in Table 3, we provide two auxiliary sourceside example sentences demonstrating the specific use of the source term of its lexical constraint, assuming that the meaning can be differentiated by the context used in the sentences rather than the terminology itself.Hereafter, we name these example sentences as positive references.

Methodology
Our methodology for semantically appropriate application of lexical constraints consists of two stages.Initially, we propose a homograph disambiguation module that can differentiate the semantics of lexical constraints.This module determines whether LNMT models should incorporate a lexical constraint or not.Subsequently, LNMT models, PLUMCOT in our case, perform the translation, either with or without the given lexical constraints.

Homograph Disambiguation
Given a few example sentences demonstrating how to specifically use a word, humans can infer the proper meaning.Likewise, our conjecture is that we can fulfill the homograph disambiguation task by leveraging these inter-sentential relationships.

Task Specification
Given n example sentences illustrating one specific meaning of a homograph, our homograph disambiguation module aims to determine whether the same word in a newly given sentence, denoted as 'New Sentence' in Fig. 1, carries the same meaning (label: 1) or not (label: 0).We conducted experiments with two example sentences (i.e., n = 2),5 and the corresponding model architecture is described in Section 3.1.2.

Model Architecture
Input Representations As illustrated in Fig. 1, sentence embeddings of example sentences and the new sentence are individually obtained from the PLM and fed into the classifier.Embedding vectors for all the sentences are extracted from the averaged hidden representations of the last K layers of frozen PLM. 6Here, the embedding vector is obtained by the average of hidden representations for the tokens that make up a homograph within the sentence.We denote this averaging operation as Pooling in Fig. 1.
Binary Classifier Similar to Sentence-BERT (Reimers and Gurevych, 2019), we use the concatenation (z ∈ R 6m+3 ) of the following as an input to the classifier: • element-wise difference for each pair where m is the dimension of the embeddings and sim(⋅, ⋅) denotes the cosine similarity function.Our prediction o ∈ [0, 1] for a "New Sentence" is calculated as where W r ∈ R (6m+3)×m and b r are the weight matrix and bias vector of an intermediate layer, respectively.W ∈ R m×1 and b are the weight matrix and bias vector for the final prediction layer followed by σ(⋅), which represents the sigmoid function.

PLUMCOT
In this subsection, we introduce our LNMT model, PLUMCOT, which stands for leveraging pre-trained language model with direct supervision on a copying score for LNMT, and its detailed implementation.To better incorporate target terms into the translations, PLUMCOT combines LeCA (Chen et al., 2021b) with PLM and strengthens a pointer network with supervised learning of the copying score.

Problem Statement
) consists of the source term C i,S and corresponding target term C i,T , LNMT aims to incorporate C 1∶n,T into its generation.The conditional probability of LNMT can be defined as ) by appending <sep> tokens followed by target terms, as illustrated in Table 4. 7 If there are no lexical constraints, a source sentence remains the same, i.e., X = X.8 Combining a source sentence with target terms leads to the modification of Eq. ( 2) as the following: p(y t |y <t , X; θ). (3)

Integration of PLM
As PLM such as BERT (Devlin et al., 2019) is trained on large amounts of unlabeled data, leveraging PLM for LNMT can provide rich contextualized information of X, even in controlled unseen lexical constraint scenarios.We first feed the source sentence X to a frozen PLM to obtain a representation B of a source sentence, where B is the output of the last layer of the PLM.Conversely, our NMT model based on Vaswani et al. (2017) receives a modified source sentence X as input.
(Will the economy ever fully recover to before Covid-19 levels?) (Corona Extra has been the top-selling imported drink in the U.S. since 1998.) Table 4: Input modification.Expected target terms are appended to the end of the source sentence.The source term can be pronounced as "co-ro-na".
Let L denote the number of encoder and decoder layers of NMT, H l be the output of the encoder of NMT at the l-th layer, and h l t denote the t-th element of H l .For each layer l ∈ [1, L], we employ multi-head attention with the output of PLM as in Zhu et al. ( 2019), denoted as MHA B .This maps the output of the NMT encoder at l − 1th layer into queries and output of PLM, B, into keys and values. 9The output of the t-th element of the NMT encoder at the l-th layer is given by where LN(⋅) denotes Layer normalization in Ba et al. ( 2016) and MHA and FFN(⋅) are the multihead attention and feed-forward network, respectively.Similar to the encoder, multi-head attention with PLM is introduced for each decoder layer. 10Combined with Section 3.2.3, a highly contextualized representation is given to the pointer network.

Supervision on a Copying Score
Pointer Network To copy target terms from X, we introduce a pointer network (Gu et al., 2016) as in Song et al. (2019); Chen et al. (2021b).For each time step, a pointer network takes in the output of the encoder and outputs a copying score g copy t ∈ [0, 1], which controls how much to copy.
The output probability of the target word y t can be calculated as 9 Please refer to Appendix D for more details. 10Please refer to Appendix F for more details.
where p copy t is a probability of copying, and p word t is a probability of the target word y t in the vocabulary. 11opying Score As implied by Eq. ( 5), inaccurately predicted g copy t results in the failure of copying target terms.However, in previous research on LNMT, the importance of a copying score was relatively understated.Despite the high probability of copying p copy t , an incorrect copying score can even lower the output probability of the target terms.Therefore, we propose a novel supervised learning of the copying score g copy t to obtain a more accurate value.
Our supervision of the copying score strengthens the copy mechanism of the pointer network by allowing the model to learn exactly when to copy.Since target terms are in the source sentence, we can determine which words should be copied from the source sentence.For example, when translating a source sentence in Table 4, the appended target term, Covid-19, must be copied.Thus, the copying score g copy t of the target term Covid-19 should be higher, and g copy t should be lower for the remaining words in the target sentence.Our training objective can be defined as where a gold copying score g t is set to zero for t ∈ {t|y t ∉ C 1∶n,T }; otherwise, g t is set to one for t ∈ {t|y t ∈ C 1∶n,T }.To mitigate the length imbalance between the target terms and remaining words in the target sentence, we set α and β to the value obtained by dividing their respective lengths from the total length.

Experiments on the HOLLY benchmark
In this section, we report the performance of our methodology when tested on the HOLLY benchmark.In Section 4.1, we evaluate the performance of our homograph disambiguation module in determining the semantic appropriateness of a lexical constraint.In Section 4.2, we assess the performance of LNMT models using positive examples from the HOLLY benchmark under conventional settings.Subsequently, we investigate the potential advantages that the homograph disambiguation module might bring when applied to the negative examples from the HOLLY benchmark.

Data
Here, we present our dataset for training the homograph disambiguation module.Our training data was collected from the Korean dictionary12 and we manually inspected the quality of each sentence.In line with Fig. 1, each example consists of a triplet of example sentences containing a common homograph.Depending on the inter-sentential relationships between each input sentence, the homograph disambiguation module outputs a binary label: "1" is assigned if the homograph carries the same meaning in all sentences, and "0" if used differently in one example sentence.The brief data statistics of the training data are reported in  At test time, we evaluated our model on the HOLLY benchmark.Specifically, for each lexical constraint, two positive references (refer to Table 3) and one of the four test example sentences ((a), (b), (c), or (d) in Table 2) is given as a triplet.

Results
We conducted experiments with two well-known variants of PLM trained on Korean corpora: klue/roberta-base, and klue/roberta-large.Our homograph disambiguation module achieved a test accuracy of 88.7%, and 92.3% when using klue/roberta-base, and klue/roberta-large, respectively.In spite of the imbalanced data distribution shown in Table 5, the values of precision and recall are balanced in both classes, as shown in Table 6.  1We pre-tokenized the Korean corpora with Mecab and built a joint vocabulary for both languages by learning a Byte Pair Encoding (Sennrich et al., 2016) model in sentencepiece (Kudo and Richardson, 2018) with 32K merge operations.
To simulate the unseen lexical constraints, we filtered out about 160K training sentence pairs with test lexical constraints on both sides when experimenting with the HOLLY benchmark.This filtering process is crucial for examining how the models cope with any lexical constraints that users might introduce.

Evaluation
Metrics We evaluated the performance of our model in terms of BLEU 14 and copy success rate (CSR).CSR is a metric for investigating the ratio of imposed lexical constraints met in translations.For a statistical significance test, we use compare-mt (Neubig et al., 2019) with p = 0.05 and 1,000 bootstraps.
Test Scenarios There were two important test cases, as shown in Table 7.Given a source sentence, we can consider the Soft Matching test case, 13 The AI HUB data can be found here: https: //www.aihub.or.kr/aihubdata/data/view. do?currMenu=115&topMenu=100&aihubDataSe= realm&dataSetSn=126.

Lexical Constraint
Test Example Expected Target Term(s) The human teeth function to break down food items into comfortably swallowable pieces, helping digestion.

Hard Matching
Table 7: Two test scenarios are described with one of test example in the HOLLY benchmark.The source term can be pronounced as "so-hwa".
which allows some morphological variations, as introduced in Dinu et al. (2019).As illustrated in Table 7, since the Korean word ᄉ ᅩ ᄉ ᅩ ᄉ ᅩᄒ ᅪ ᄒ ᅪ ᄒ ᅪ can be used in multiple different forms via inflection, any one of the expected candidates (digest, digestion, and digestive) presented in the translation is considered to be correct in terms of CSR.
We also have the Hard Matching15 test case where the exact target term (e.g., digestion) presented in its reference has to be incorporated in the translation.Note that, this cannot be tested on negative examples since the target terms in lexical constraints do not appear in their references.

Baselines
• Code-Switching (CS) (Song et al., 2019) replaces source terms with aligned target terms and learns to copy them via pointer network.
• LeCA (Chen et al., 2021b) modifies the source sentence as described in Table 4, and utilizes pointer network during training.

Main Results
Simulating Unseen Lexical Constraints ical constraints are irrelevant to the context.By removing inappropriate constraints, all the models achieve a consistent and statistically significant improvement in translation quality by a large margin.

Ablation Study
We study the effect of each component of PLUMCOT, and the results are provided in

Qualitative Analysis
Table 11 provides translated examples.Given a lexical constraint, PLUMCOT incorporates the target term correctly.In a negative example, the meaning of ᄉ ᅦ ᄉ ᅦ ᄉ ᅦᄌ ᅦ ᄌ ᅦ ᄌ ᅦ is properly translated into detergent by PLUMCOT with correction. 20We provide more examples in Table 15.
5 Related Work

Lexically-constrained NMT
Recent work on LNMT broadly falls into two categories: decoding algorithms and inline annotation.During beam search, decoding algorithms enforce target terms to appear in the output (Hokamp and Liu, 2017;Anderson et al., 2017;Chatterjee et al., 2017;Hasler et al., 2018).This approach ensures a high CSR, but the decoding speed is significantly degraded.To alleviate this issue, Post and Vilar (2018) suggests a decoding algorithm with a complexity of O(1) in the number of constraints.Another variation on decoding algorithms utilizes word alignments between source and target terms (Song et al., 2020;Chen et al., 2021a).
In inline annotation studies, the model is trained to copy target terms via modification of training data.Either a source term is replaced with the corresponding target term, or the target term is appended to the source sentence (Song et al., 2019;Dinu et al., 2019;Chen et al., 2021b).
Concurrently, Bergmanis and Pinnis (2021); Niehues (2021); Xu and Carpuat (2021b) consider the morphological inflection of lexical constraints during the integration of target terms.While these methods incur a slight computational cost and provide better translation quality, target terms are not guaranteed to appear (Chen et al., 2021a;Wang et al., 2022a).To better copy target terms in a source sentence, a pointer network (Vinyals et al., 2015;Gulçehre et al., 2016) that uses attention weights to copy elements from a source sentence is intro-

Reference
Reducing transaction taxes and raising possession taxes are the core principles of the real estate tax system, but it is challenging to get them applied.

Vanilla
Reducing transaction taxes and strengthening holding taxes are the grand principles of real estate ::::: taxes, but it is also difficult to apply them.

LeCA
Reducing transaction taxes and strengthening holding taxes are the main principles of real estate ::: tax, but it is difficult to apply them.

PLUMCOT
The main principle of the real estate tax system is to reduce transaction taxes and strengthen holding taxes, but it is difficult to apply them.

Reference
For those who spend more than 50 thousand won for purchasing items during this event, kitchen detergents will be given as a gift.
PLUMCOT w/o correction The kitchen tax system will be given as a prize to customers who purchase more than 50,000 won during this event.

PLUMCOT w/ correction
Customers who purchase more than 50,000 won during this event will receive a gift of kitchen detergent .
Table 11: Example translations for positive and negative examples.The source term can be pronounced as "se-je".
duced (Gū et al., 2019;Song et al., 2019;Chen et al., 2021b).In this work, we further enhance the copying mechanism of a pointer network via supervised learning of a copying score that achieves better performance in terms of BLEU and CSR.

Homograph Issue in LNMT
Michon et al. ( 2020) points out the homograph issue in LNMT in an in-depth error analysis of their model.To the best of our knowledge, the homograph issue was explicitly addressed first in Öz and Sukhareva (2021).In their work, given a source homographic term, the most frequent alignment is selected as its correct lexical constraint, while the other alignments are treated as negative terms that should be avoided in the translation.However, low-frequency meanings are important for LNMT since it is not guaranteed that users always bring up generic terminology.Different from their method, our homograph disambiguation module infers the meaning of lexical constraints and makes decisions to impose them or not.Furthermore, we confirm that our method works equally well on "unseen" homographs.

Integration of PLM with NMT
Followed by the success of PLM, researchers attempted to distill the knowledge of PLM into NMT (Zhu et al., 2019;Weng et al., 2020;Xu et al., 2021).BERT-fused (Zhu et al., 2019) is one such method; it plugs the output of BERT into the encoder and decoder via multi-head attention.We borrowed the idea from BERT-fused, and for the first time, combined LNMT and PLM, which works well even in "unseen" lexical constraints by leveraging the rich contextual information of PLM.

Conclusions
In this paper, we investigate two unexplored issues in LNMT and propose a new benchmark named HOLLY.To address the homograph issue of the source terms, we built a homograph disambiguation module to infer the exact meaning of the source terms.We confirm that our homograph disambiguation module alleviates mistranslation led by semantically inappropriate lexical constraints.PLUMCOT is also proposed to improve LNMT by using the rich information of PLM and ameliorating its copy mechanism via direct supervision of a copying score.Experiments on our HOLLY benchmark show that PLUMCOT significantly outperforms existing baselines in terms of BLEU and CSR.

Limitations
Our study includes some limitations that must be addressed.Some test examples might have wrong predictions made by the homograph disambiguation module.Specifically, in positive examples where lexical constraints should be imposed, its errors result in wrong corrections (i.e., the elimination of necessary lexical constraints).Table 12 shows how these erroneous corrections affect the results.We can observe an overall decline in CSR; however, it does not hurt the translation quality.We verify that the differences in BLEU resulting from wrong corrections are not statistically significant for all the methods.Considering the gain achieved in negative examples, as seen in Fig. 2, our proposed homograph disambiguation might serve as a useful starting point to address homographs in LNMT; however, there is still room for improvement.Our current homograph disambiguation module is designed as a stand-alone system outside the LNMT.However, building an end-to-end system can be beneficial, which can be addressed in future work.

A HOLLY benchmark
We collected monolingual example sentences that contain one of the pre-specified homographs from the Korean dictionary.For each homograph, retrieved example sentences are classified into multiple groups according to their meanings.We chose one group with the least frequent meaning and used its examples as positive references to determine a lexical constraint for the homograph.Examples from the other groups are considered as negative references.Eventually, six reference sentences were collected for each homograph; more specifically, four positive references and two negative references.
Setting aside two positive references for homograph disambiguation, as stated in Table 3, we outsourced the translation of two positive and negative examples, as introduced in Table 2.For positive examples, professional translators were requested to translate source terms of lexical constraints into pre-defined target terms.We guide the translators to carefully translate negative examples by focusing on the exact meaning of lexical constraints.

B Implementation details B.1 Configuration of PLUMCOT
We implemented PLUMCOT and all the models based on fairseq (Ott et al., 2019).We matched the embedding dimensions, the number of layers, and the number of attention heads of all models for a fair comparisons.PLUMCOT was trained from scratch and klue/roberta-large (Park et al., 2021) was used for our PLM.21

B.2 Computational Cost
All the experiments were conducted on a single A100 GPU.It takes about 84 hours to train PLUMCOT and 5 hours to train the homograph disambiguation module.The number of training / total parameters for PLUMCOT is 156M and 493M.The number of training / total parameters for the homograph disambiguation module is 6M and 343M.

C Ablation Studies
C.1 Weights of the supervised learning of a copying score The results in Table 13 were reported according to the different weights λ of the supervised learning of the copying score in Eq. ( 6).Based on experimental results, we were able to find a compromise where λ is 0.2.

C.2 Number of example sentences
We experimented with a varying number of example sentences.As we use more example sentences, the information from the inter-sentential relationships becomes richer, eventually improving homograph disambiguation performance.Experiments with n = 1, 2, and 3 show an accuracy of 91.33%, 92.33%, and 92.67%, respectively.Although an experiment with n = 3 provides the best accuracy, collecting positive references can sometimes be burdensome to users.Therefore, we conclude that n should be decided by considering its trade-off.

C.3 Randomly Sampled Test Constraints
Different from our HOLLY benchmark, at test time, lexical constraints were randomly sampled from the alignments in each sentence pair in previous studies (Dinu et al., 2019;Song et al., 2019;Chen et al., 2021b,a;Wang et al., 2022b

D Equation Details
Let Q, K, and V be the query, key, and value in (Vaswani et al., 2017), respectively.Then MHA in Eq. ( 4) and Eq. ( 8) can be calculated as where the projection matrices are parameters

E Input data augmentation
As illustrated in Table 4, we modify a source sentence X as X by appending <sep> tokens followed by target terms.Since lexical constraints are domain-specific or user-provided terminology, we exclude the top 1,500 frequent words from a 32K joint dictionary.In our training, we randomly Positive Example (Lexical Constraint: Reference New types of pathogens with resistance to antibiotics have emerged, threatening public health.

Vanilla
A new type of pathogens that are ::::::: tolerant of antibiotics have emerged, threatening the health of the people.

LeCA
A new type of pathogen that is ::::::: tolerant of antibiotics has emerged, threatening the health of the people.

PLUMCOT
A new type of pathogen that has resistance to antibiotics has emerged, threatening the health of the people.

Reference
His being quiet is because of his introverted personality.

Reference
The company explained that the contract cancellation was decided because of the reason relevant to contract procedures between the contract parties.

Vanilla
The company explained that it decided to nullify the contract because of the procedurality of the contract between the parties.

LeCA
The company explained that it decided to nullify the contract because of the procedurality of the contract between the parties.

PLUMCOT
The company explained that it decided to nullify the contract on the reason of the procedure of the contract between the parties.

Reference
In capitalist countries, private assets can be disposed of according to the will of their owners.
PLUMCOT w/o correction In capitalist countries, private property can be disposed of according to the reason of the owner.

PLUMCOT w/ correction
In capitalist countries, private property can be disposed of according to the owner's will.

Figure 1 :
Figure 1: Structure of the homograph disambiguation module.
19 Automatically Retrieved from Bilingual Terminology Bilingual Terminology Source Term Target Term Translation Will the economy ever fully recover to before Covid-19 levels?ᄉ ᅥ ᆫᄇ ᅧ ᆯᄌ ᅵ ᆫᄅ ᅭᄉ ᅩ

Table 1 :
19 Automatically Retrieved from Bilingual Terminology Translation Covid-19 Extra has been the top-selling imported drink in the U.S. since 1998.Corona Automatically retrieved lexical constraint. in previous studies, a significant portion of the lexical constraints is exposed during training.Wang et al. (2022b) demonstrated the overlapped ratio of lexical constraints between the training and evaluation data (35.6% on average).Meanwhile, Zeng et al. (2022) also raises the issue of the high frequency of lexical constraints for test sets appearing in the training data.

Table 5 .
Note that any homographs are not allowed to be overlapped across train, validation, and test datasets.

Table 5 :
Data statistics for the homograph disambiguation task.

Table 6 :
Test accuracy of homograph disambiguation module leveraging klue/roberta-large on the HOLLY benchmark.The accuracy is reported in terms of precision, recall, and F1 on each class.

Table 8 :
Performance of the vanilla Transformer on positive examples.Without filtering, lexical constraints can be memorized by the network during training.arecomparedinTable9.It is shown that PLUMCOT outperforms all the baselines in both metrics by a large margin.Since we simulate the unseen lexical constraints, the external information from the PLM contributes to the increase in BLEU.Combined with the supervision on a copying score, PLUMCOT achieves the highest CSR.18The overall BLEU scores of Hard Matching are shown to be greater than Soft Matching as the expected target terms drawn from reference translations are given to the models.Figure 2: Effect of homograph disambiguation tested on negative examples."w/ correction" refers to the removal of semantically inappropriate lexical constraints determined by the homograph disambiguation module.CSR was evaluated on Soft Matching.

Table 12 :
Effect of homograph disambiguation tested on positive examples on Soft Matching.

Table 13 :
Results of BLEU and CSR according to different λ.

Table 14 :
).Ten different test sets were built based on ten randomly sampled sets of lexical constraints, as described inChen et al.  (2021a,b).Test statistics are reported in Table14.It is shown that PLUMCOT achieves the highest BLEU.The CSR is slightly lower than Cdalign, indicating that the gain for "seen" constraints is insignificant.22Results on randomly sampled test lexical constraints.Statistics are drawn from five randomly constructed test datasets.

Table 15 :
More translations for positive and negative examples.

Table 16 :
Hyperparameters and model configuration of PLUMCOT.