Lexical Translation Inconsistency-Aware Document-Level Translation Repair

Following the idea of “one translation per dis-course”, in this paper we aim to improve translation consistency via document-level translation repair (DocRepair), i.e., automatic post-editing on translations of documents. To this end, we propose a lexical translation inconsistency-aware DocRepair to explicitly model translation inconsistency. First we locate the inconsistency in automatic translation. Then we properly provide translation candidates for those inconsistency. Finally, we pro-pose lattice-like input to properly model inconsistent phrases and their candidates. Experimental results on three document-level translation datasets show that based on G-Transformer, a state-of-the-art document-to-document (Doc2Doc) translation model, our Doc2Doc DocRepair not only achieves improvement in translation quality in BLEU scores, but also greatly improves lexical translation consistency.


Introduction
Although neural machine translation (NMT) has made remarkable progress (Bahdanau et al., 2015;Vaswani et al., 2017), sentence-level NMT still suffers from the serious problem of lexical translation inconsistency due to the lack of inter-sentence context.To better model inter-sentence context, previous studies in document-level NMT propose various context-aware models which use sentences in the wider-document context, thus implicitly learning discourse correlations as a by-product of optimising an NMT model (Maruf et al., 2022).However, as these models rarely try to model discourse phenomena explicitly, there still exist much rooms for improvement on discourse phenomena.In this paper, we follow up the idea of "one translation per discourse" (Merkel, 1996;Carpuat, 2009;Türe et al., 2012;Guillou, 2013; Khotaba and Tarawneh, #1: #13: ... ... #17: ... #20: ... #1: sun yen-tzu in africa ... and gets injured by stones ... #13: ... sun yanzi experienced rare sandstorms in the area ... #17: after hearing the locals explain that dust storms are a symbol of good luck ... #20: ... invite sun yanzi to sit on the camel and receive cheers from the local people ... #1: sun yen-tzu sent ... she was hit by sand and rocks ... #13: ... sun yanzi experienced the rare dust storms in the area ... #17: after hearing the locals explain n that sand storms were symbols of good fortune ... #20: ... asking sun yantzu to sit on the camel and receive a cheer from the local people ... #1: sun yanzi delivers love to africa ... #13: ... sun yanzi experienced a sandstorm that was rarely seen in the area #17: she later heard the locals explaining that the sandstorm is a sign of good luck ... #20: .. 当地人/dang_di_ren are inconsistent in the sentence-level and document-level NMT systems but tend to be consistent in the reference. 2015) and focus on lexical translation consistency, which is one of the most serious issues in documentlevel (Chinese-to-English) translation (Kang et al., 2021;Lyu et al., 2021b).Our goal is to improve translation consistency via document-level translation repair (DocRepair for short (Voita et al., 2019)), i.e., automatic post-editing on translations of documents.
Figure 1 shows an example of an input document and its translation from both state-of-the-art sentence-level and document-level NMT models.
The source words like 孙燕姿/sun_yan_zi, 沙尘 暴/sha_chen_bao and 当地人/dang_di_ren, occurring two or more times within the source document, unexpectedly get different translations while they are translated consistently in its reference (human translation).For example, person name 孙 燕姿/sun_yan_zi is translated into sun yen-tzu and sun yanzi by sentence-level NMT.Such inconsistent translations, however, tend to confuse readers.Moreover, even some context-aware documentlevel NMT models like G-Transformer (Bao et al., 2021) could not well alleviate this phenomenon as shown in the figure .Very few studies in document-level NMT explicitly encourage lexical translation consistency.Lyu et al. (2021b) obtain a word link for each source word in a document and exchange their context information in encoding by using an auxiliary loss to constrain their translation being consistent.Kang et al. (2021) and Lyu et al. (2022) both construct source-side lexical chains, and use different approaches to learn (or model) translations for tokens within the same lexical chain.Different from above studies which encourage translation consistency in the translation process, in this paper we aim to improve translation consistency via DocRepair.Different from Voita et al. (2019) which implicitly learns inconsistency within document translation, we propose a lexical translation inconsistencyaware DocRepair model to explicitly correct translation inconsistency.Given automatic translation T of a document S, either from sentence-level NMT or document-level NMT, this is done by the following steps.First, in translation T we locate inconsistent phrases, each of which consists of one or more consecutive tokens.Then, we provide translation candidates for those inconsistent phrases.Finally, we adapt G-Transformer, a state-of-the-art document-to-document translation model, to repair document-level translation T equipped with inconsistent phrases and their candidates.
Overall, we make the following contributions.
• We propose a novel approach to repair translation of documents with explicit aim of correcting translation inconsistency.In this approach, we use lattice-like input to model inconsistent phrases and their candidate translations.
• Experimental results in three document-level translation datasets show that given translation from either sentence-level or document-level NMT models, our DocRepair approach not only improves translation performance in BLEU, but also greatly improves lexical translation consistency.

Problem Statement
Formally, we use S = {S (k) }| K k=1 to denote a sourceside document composed of K source sentences, and assume each source-side sentence k=1 to denote its automatic translation and T (k) = {t (k) j }| J j=1 to represent the automatic translation of the k-th sentence in S. Finally, we use m }| M m=1 to denote the corresponding target-side gold document and the gold translation of the k-th sentence, respectively.
Therefore, assuming that the repair is done in a left-to-right way, we can decompose the documentlevel repair probability as where k is the index of the current sentence, T −k (or S −k ) represents all other sentences in T (or S), and Y (<k) represents the translations ahead of the current sentence.
If the source document S is totally ignored in the repair, then the task could be viewed as monolingual DocRepair (Voita et al., 2019) and Eq. 1 can be simplified as which translates a document T in target-side language into another document Y in the same language.However, totally ignoring source-side knowledge from S would make it hard for a monolingual DocRepair model to implicitly detect the inconsistency inside T .By only looking the sentencelevel NMT output in Figure 1, for example, it is hard to tell that sun yen-tzu and sun yanzi are inconsistent phrases.Therefore, we make use of source-side document S to locate the inconsistency in T (Section 2.2).For each inconsistent phrase, we provide a translation candidate list (Section 2.3), which is extracted from T .Being aware of inconsistent phrases, we adapt G-Transformer (Bao et al., 2021) with lattice-like input (Lai et al., 2021) as our Doc2Doc DocRepair model (Section 2.4).Overall, in this paper we approximate the DocRepair probability as where ctx (S, T ) returns the inconsistent phrases in and their respective candidate list.

Locating Inconsistency in Translation
In translation T , we say a phrase is inconsistent if its counterpart in the source side repeats two or more times in S and has different translations in T .Given a source document S, we follow Lyu et al. (2022) and extract N lexical chains l=1 } records all positions of word w i repeated L times (L ≥ 2) in document S, where a and b indicate the sentence index and word index of a position, respectively.Then we obtain C i 's translation according to word alignment between sentence pairs in (S, T ), where ct i l could be a phrase.1 Therefore, if there exist two entries in CT i which are not consistent, then we say source word wi is an inconsistency trigger and ct i l ∈ CT i is an inconsistent phrase in translation T . 2 We traverse all lexical chains to obtain all inconsistency phrases in T .
Taking the sentence-level NMT output in Figure 1 as an example, we extract a lexical chain for source word 孙燕姿/sun_yan_zi as it appears three times in the document. 3Then according to the result of word alignment, we obtain its translation CT = (sun yen-tzu, sun yanzi, sun yanzi).Since there exist inconsistency between phrases sun yen-tzu and sun yanzi, both sun yen-tzu and sun yanzi in the 1st, 13th, and 20th sentences are inconsistency phrases.Similarly, sandstorms and dust storms in the 13th and the 17th sentences, locals and local people in the 17th and 20th sentences are inconsistency phrases, which are related to source-side inconsistency triggers 沙尘暴/sha_chen_bao and 当地人/dang_di_ren, respectively.

Obtaining Candidates for Inconsistency
Once we have located inconsistency in translation T , we further explicitly provide a candidate set of other possible translations in T for the inconsistency.Here we hope that the candidate set would provide a resolution to the inconsistency.
If source word w i of the i-th lexical chains C i is an inconsistency trigger, we provide a translation candidate set from its translation CT i .Each entry in the set is associated with a weight indicating the translation probability from wi.As in sentence-level NMT output of Figure 1, the translation candidate set of inconsistency trigger 孙 燕姿/sun_yan_zi is {sun yen-tzu: 1/3, sun yanzi: 2/3}, where 1/3 and 2/3 are translation probability.Likewise, the translation candidate sets of 沙 尘暴/sha_chen_bao and 当地人/dang_di_ren are {sandstorms: 1/2, dust storms: 1/2} and {locals: 1/2, local people: 1/2}, respectively.

Sentence To Word Lattice
So far, we provide target-side translation T with inconsistent phrases and their corresponding translation candidate set.To let the DocRepair model be aware of inconsistency and potential resolution, we follow Lai et al. (2021) and propose word latticelike input for DocRepair.
As shown in the bottom-right corner of Figure 2, a word lattice is a directed acyclic graph, where the nodes are positions in the sentence, and each directed edge represents a word.In particular, we replace inconsistent phrases with their corresponding candidate sets.As shown, word lattice-like input consumes all entries in the candidate set and even the source-side trigger word so that models could explicitly exploit the potential resolutions to the inconsistency.For those words without consistency issue, such as experienced and rare in the figure, they are essentially on the path from the beginning word [BOS] to the end word [EOS].The challenges to model the lattice-like inputs include: 1) encoding the lattice tokens while preserving lattice structures (Lai et al., 2021); and 2) differentiating translation candidates with different quality.Next we present our solutions to the two challenges.
Sentence-level NMT 孙燕姿: {sun yen-tzu: 1/3 ; sun yanzi: 2/3} 沙尘暴: {sandstorm: 1/2 ; dust storm: 1/2} 当地人: {locals:1/2 ; local people: 1/2} 30 2. Obtain Candidates for Inconsistency Token Lattice Position.We assign each node in the lattice graph with a lattice position, whose value is its longest distance from the beginning word [BOS], i.e., the number of nodes in between.Then we set the position of a token as the position of its preceding node.For example, the position values for dust and storm are 14 and 15, respectively.
Token Weight.According to the type of token, we set token weight differently.
• For those tokens without inconsistency issue, we set their weight as 1.0.
• For tokens of source-side trigger words, like 孙 燕姿/sun_yan_zi and 沙尘暴/sha_chen_bao, we set their weight as 1.0, too.
• For tokens in candidate sets, we set their value as its corresponding translation candidate's probability.For example, in the translation candidate set of the trigger word 孙燕姿/sun_yan_zi, {sun yen-tzu: 1/3, sun yanzi: 2/3}, we set the weight for tokens in sun yen-tzu as 1/3 while tokens in sun yanzi as 2/3.

DocRepair Model with Lattice-Like Input
As shown in the up-right corner of Figure 2, we linearize a lattice graph into a sequence with preprepared lattice position.The input to the encoder is where X is the lattice-like input, WE (•) and PE (•) return word embedding and sinusoidal positional embedding, respectively.Weight (•) returns a weight vector for tokens in X.

Training
The training consists of two stages: we first pretrain our Doc2Doc DocRepair model on pseudo document-level instances; then fine-tune the pretrained model on document-level instances.

Pre-training on Pseudo Doc2Doc Instances.
Due to the limited size of document-level parallel data, we make use of sentence-level parallel dataset SL (S) , SL (Y) .On the one hand, we translate source sentences SL (S) by a sentencelevel NMT trained on the dataset and get automatic translation SL (T ) .On the other hand, we extract phrase translation table after doing word alignment (Dou and Neubig, 2021) 4 between sentence pairs in SL (S) , SL (Y) .Given a sentencelevel triple (S, T, Y ) ∈ SL (S) , SL (T ) , SL (Y) , where S is the source-side sentence while T and Y are its automatic and reference translation, respectively.So (T, Y ) is a sentence-level translation repair instance.
To construct lattice-like input, we need to locate inconsistency phases in T , and properly provide their candidate set.Given a source sentence S = {si}| I i=1 with I words, we simply view word si is an inconsistency trigger if it 1) is neither a stop word nor a high frequency word; and 2) has two or more translations in phrase translation table.Then for trigger si, we randomly select 1 (or 2 or 3) different translations from the phrase translation table and together with si's translation in T , and construct its translation candidate set.Finally, we shuffle all (T, Y ) pairs and merge neighbouring pairs as a document-level DocRepair instance with max length of 512 on both input and output.
Fine-Tuning on Doc2Doc Instances.In the finetuning stage, we only use document-level parallel dataset DL (S) , DL (Y) .Given a document-level parallel pair (S, Y), we get its automatic translation T by above sentence-level NMT.Then for a document-level triple (S, T , Y), we get a Doc2Doc training instance according to Section 2.

Reference-based Lexical Translation
Consistency Metric Lyu et al. (2021b) propose a metric to evaluate lexical translation consistency, named lexical translation consistency ratio (LTCR), which is based on whether translations of repeated words are consistent.However, it does not take the reference into account and ignores the correctness of these translations.Therefore, we extend LTCR and propose ref-LTCR by comparing the consistency between automatic and reference translations.
Given a document-level triple (S, T , Y), let us assume that source word w appears k times in S. Based on word alignment between S and T , we could get its k automatic translations, i.e., (t1, • • • , t k ), where ti may consist of zero, one or more words.Similarly, we could get its k reference translations (y1, • • • , y k ).For a pair of two automatic translations (ti, tj), the basic idea of ref-LTCR is that we encourage translation consistency between them only if their reference counterparts (yi, yj) are consistent.Specifically, we define the precision and recall values for word w as: where function 1(condition) returns 1 if the condition is satisfied, otherwise 0; ti = tj returns true if they are consistent, otherwise false.
In above it calculates ref-LTCR for a single word in a document.Likewise, we could apply the metric to all source words in a document-level parallel dataset by summing up all these words' corresponding numerators and denominators, respectively.After calculating the values of precision and recall, we report their F1 score, which is the harmonic mean of the two.
In brief, besides illustrating how frequent translation pairs of w is consistent within a document, ref-LTCR also measures how similar the consistency is compared against the reference translation.The higher ref-LTCR is, the more likely w is translated as in reference.See Appendix A for the computation of ref-LTCR when there exist multiple reference translations.

Experimentation
To verify the effectiveness of our proposed approach, we conduct experiments on three datasets with three language pairs, i.e., Chinese-to-English (ZH→EN), English-to-Chinese (EN→ZH) and German-to-English (DE→EN).

Experimental Setup
Datasets.For NIST (ZH↔EN), the pre-training data is from LDC and contains 2.0M sentence pairs.The document-level fine-tuning data is a subset of the pre-training set, including 66.4K documents with 0.83M sentence pairs.We use NIST 2006 as the development set and combine NIST 2002NIST , 2003NIST , 2004NIST , 2005NIST and 2008 as the test set.
For PDC (ZH→EN), the document-level finetuning dataset is from Sun et al. (2022), which contains 10K documents with 1.39M sentence pairs.We combine the 1.39M sentence pairs and above NIST (ZH→EN) 2.0M sentence pairs as the pretraining data.
For Europarl (DE→EN), the document-level fine-tuning training set, and the development and test sets are from Maruf et al. (2019).We also use the sentence pairs from the fine-tuning training set as the pre-training data.See Appendix B for detailed statistics and preprocessing of the experimental datasets.Model Settings.For DocRepair models, we use G-Transformer (Bao et al., 2021) as the implementation of Transformer and extend it, which enlarges the translation unit to a whole document.See Appendix C for more details of the model settings.Evaluation.To evaluate the overall repair performance, we report both sentence-level BLEU (s-BLEU) and document-level BLEU (d-BLEU) (Papineni et al., 2002).All BLEU scores calculated by the multi-bleu.perlscript and are case-insensitive.To evaluate lexical translation consistency, we report both LTCR (Lyu et al., 2021b) and ref-LTCR.Baselines.We compare our DocRepair approach against three baselines.
All the instances are without word lattice-like input.
• DocRepair (Transformer): We pre-train vanilla Transformer on sentence-level translation repair instances of the same pre-training dataset and then fine tune it on document-level translation repair instances.All the instances are without word lattice-like input.Since we may not be able to recover sentence-level repair result from the output, we only report d-BLEU score for this baseline.
• DocRepair (G-Transformer): The pre-training and fine-tuning datasets are same as our approach except that this baseline does not use word latticelike input.

Experimental Results
In inference, the trained DocRepair models can repair translation from both sentence-level NMT We note that over the baseline of DocRepair (G-Transformer), the averaged improvement our approach achieved in s-BLEU/d-BLEU is 0.48/0.40,which is much less than the improvement of 3.96/2.18 in LTCR/ref-LTCR.This is because that BLEU is not sensitive to improvement in consistency in document-level translations.As shown in case study (Appendix F), though our approach improves translation readability and achieves consistent translations for the source words appearing multiple times, it has limited effect in BLEU.

Results of Repairing Document-level NMT Translation
Moving to translations of document-level NMT models, Table 3 compares the performance before and after repair for the four translation tasks.It shows that though document-level NMT achieves higher performance in s-BLEU/d-BLEU than sentence-level NMT, except on Europarl (DE→EN) it has very limited effect in terms of LTCR and ref-LTCR.Based on the improved translation, our approach further significantly improves lexical translation consistency while it slight improves performance in BLEU.

Analysis
Next, we take NIST ZH→EN translation as a representative to discuss how our proposed approach improves performance.

Ablation Study
We further conduct ablation study to investigate the contributions of the three components in our model: 1) token lattice position; 2) source-side trigger words; and 3) token weights.From Table 4, we first observe that token lattice position contributes most as it is essential to preserve lattice structure.Second, additionally including source-side trigger word is also helpful as the DocRepair model could translate them under the document-level context.

Statistics about Inconsistency
In the fine-tuning dataset, on average each document has 10.89 inconsistent phrases while each sentence has 0.87 ones.These inconsistency phrases account for 9.19% of all tokens in the translation.
For inconsistency phrases, the number of their translation candidates differ greatly.As shown in Table 5, about 98.71% of our interested words have 4 or less candidates.This is the reason that we randomly choose 2∼4 translation candidates for each inconsistency when pre-training models on pseudo Doc2Doc instances.

Effect of Different Pre-training Strategies
In the pre-training stage, we pre-train the model on pseudo document-level dataset which originates from a large sentence-level parallel dataset.Here,

Human Evaluation
We randomly select 200 groups from the test set and conduct human evaluation on them.For each group, it contains four consecutive source-side sentences, and their two translations, i.e., the sentencelevel NMT output and its repaired version by our DocRepair model.The two translations are presented with no indication which one is repaired.Following Voita et al. (2019) and Lyu et al. (2021b), the task is to choose one of three options: (1) the first translation is better, (2) the second translation is better, and (3) the translations are of equal quality.Two annotators are asked to avoid the third option if they are able to give preference to one of the translations.
Table 7 shows the results of human evaluation.On average the annotators mark 46% cases as having equal quality.Among the others, our approach outperforms Transformer in 65% cases, suggesting that overall the annotators have a strong preference for our repaired translation.

Related Work
The idea of "one translation per discourse" has been studied in both document-level translation and repair (i.e., post-editing).

Encouraging Lexical Translation Consistency in
Translation.There exist many studies in MT that explicitly encourage lexical translation consistency.In statistical machine translation (SMT), for example, Gong et al. (2011)  Encouraging Lexical Translation Consistency in Post-Editing.In SMT, Carpuat (2009), Xiao et al. (2011) and Garcia et al. (2014Garcia et al. ( , 2017) ) propose different post-editing approaches to re-translate those repeated source words which have been translated differently.Pu et al. (2017) aim to improve translation consistency for repeated nouns.They design a classifier to predict whether a pair of repeated nouns in a text should be translated by the same noun in target-language.Moving to NMT, to our best knowledge, this is the first work that explicitly focuses on document-level lexical translation consistency in post-editing.The most related work to ours is Voita et al. (2019), who propose a contextaware model that performs post-editing on foursentence fragment of translations and correct the inconsistencies among individual translations in context.Different from them, we extend the local context from four sentences into a document.More importantly, our DocRepair model is inconsistencyaware with lattice-like input which consumes inconsistency translation.

Conclusion
In this paper, we have proposed an inconsistencyaware DocRepair approach to improve documentlevel translation consistency via automatic postediting.We first locate inconsistency in text translation and provide translation candidates for each inconsistency.Then we use lattice-like input to properly model inconsistency and their candidates in a document-level repair model.Experimental results on three document-level translation datasets show that our approach not only achieves improvement on translation quality in BLEU, but also greatly improves lexical translation consistency.both pre-training and fine-tuning stage, we use early-stopping strategy with the patience as 10 and choose the best checkpoint according to the valid loss.The whole training process takes approximately 40 hours.In inference, we set the beam size to 5.

D Details of Sentence-Level and Document-level NMT Models
For the sentence-level NMT model, we use Gtransformer (Bao et al., 2021) as the implementation of the Transformer-base with full mode to generate sentence-level translations. 7The training datasets for the sentence-level NMT models are same as the pre-training datasets in Table 8.
For the document-level NMT model, we also use G-transformer with partial mode to generate document-level translations.We fine-tune document-level NMT on sentence-level Transformer described above using a document-level dataset, same as the fine-tuning datasets in Table 8.
For both sentence-level and document-level NMT models, we use the same parameter settings as in G-Transformer (Bao et al., 2021) with dropout as 0.3.

E Model Parameter
Table 9 shows the number of parameters used in our systems.Except the system without trigger words, the parameters of other systems are exactly the same.Adding trigger words increases the size of parameter since it introduces source-side vocabulary.It is also feasible not to include trigger words (i.e., w/o tri.word) in practice with a slight performance drop.

F Case Study
To better illustrate how our model improves lexical consistency, we provide an example from NIST 2004 test set.As shown in Figure 3, we observe  that in this example, the sentence-level NMT model translates source-side repeated words into different translations.For example, person name 马 兹尔/ma_zi_er maps into three different translations, i.e., marzir, marshall and mahathir while DocRepair (G-Transformer) could not fix such inconsistency.By contrast, our approach consistently repairs the translation of 马兹尔/ma_zi_er into marzir.Compared to the reference translation mazel, thought not correct, the translation marzir would not confuse readers.This explains that BLEU is not sensitive to improvement in translation consistency.

Figure 1 :
Figure 1: An example of document-level Chineseto-English translation from the test set NIST 2008, where the source words like 孙燕姿/sun_yan_zi, 沙尘暴/sha_chen_bao and . . . . . . . ..当地人/dang_di_ren are inconsistent in the sentence-level and document-level NMT systems but tend to be consistent in the reference.

Figure 2 :
Figure 2: Illustration of our proposed approach.
Different fromVoita et al. (2019) which use vanilla Transformer as the DocRepair model, we alternatively choose G-Transformer(Bao et al., 2021) as the base model.G-Transformer is a Doc2Doc translation model which views the source document and target document as long sequences.It uses combined attention, i.e., local attention and global attention to both focus on current sentence and extract contextual information from other sentences.More importantly, it could recover sentence-level translation from the long output.It achieves stateof-the-art performance in document-level translation.For more details, please refer toBao et al. (2021).
use cache to store recent translation and Türe et al. (2012) design a few consistency features to improve translation consistency in document-level translation.Moving to NMT, both Kang et al. (2021) and Lyu et al. (2021b) perform corpus study and observe that document-level translation of NMT suffers seriously from translation consistency.Lyu et al. (2021a) constrain repeated words in a document having similar hidden states, thus encourage their translations being consistent.Both Kang et al. (2021) and Lyu et al. (2022) construct lexical chains which consist of repeated words in a document.They use different approaches to learn (or model) each chain's translation.

Figure 3 :
Figure 3: An example of document-level Chinese-to-English translation from our test set.

Table 1 :
Experimental results on the test sets of NIST ZH→EN and EN→ZH translations when repairing sentencelevel NMT translation.

Table 2 :
Experimental results on the test sets of PDC ZH→EN and Europarl DE→EN translations when repairing sentence-level NMT translation.

Table 3 :
Experimental results on the test sets when repairing document-level NMT translation.

Table 5 :
Number of translation candidates.

Table 6 :
Experimental results with different pre-training strategies.

Table 7 :
Human evaluation results on 200 sentence groups from our test set.

Table 9 :
Parameter (in millions) comparison of our different DocRepair systems.