Divide and Rule: Effective Pre-Training for Context-Aware Multi-Encoder Translation Models

Multi-encoder models are a broad family of context-aware neural machine translation systems that aim to improve translation quality by encoding document-level contextual information alongside the current sentence. The context encoding is undertaken by contextual parameters, trained on document-level data. In this work, we discuss the difficulty of training these parameters effectively, due to the sparsity of the words in need of context (i.e., the training signal), and their relevant context. We propose to pre-train the contextual parameters over split sentence pairs, which makes an efficient use of the available data for two reasons. Firstly, it increases the contextual training signal by breaking intra-sentential syntactic relations, and thus pushing the model to search the context for disambiguating clues more frequently. Secondly, it eases the retrieval of relevant context, since context segments become shorter. We propose four different splitting methods, and evaluate our approach with BLEU and contrastive test sets. Results show that it consistently improves learning of contextual parameters, both in low and high resource settings.

with sentence length, although some recent works 115 try to solve this constraint (Tay et al., 2020).   then C D keeps the same document boundaries as 269 C. Figure 1 illustrates two examples of parallel 270 sentences that are split in the middle. In both ex-271 amples, a context-aware system needs to look at 272 S i,1 for translating S i,2 correctly, i.e. to look at past 273 context. In the first one, the English neuter pronoun 274 "it" could be translated into "il" or "elle", according 275 to the gender of its antecedent (there is no singular 276 neuter 3rd-person in French). The antecedent "a 277 project", which is in the previous segment, allows 278 to disambiguate it into "il". In the second example, 279 the adjective "possible" can be correctly translated 280 into its plural version "possibles" by looking back 281 at the noun it refers to: "organisms".

291
Following this method, it can happen that S i,j and 292 T i,j , with j = 1, 2, are not parallel, as illustrated 293 in the second example of Figure 1. The verb "are" 294 belongs to S i,1 , but its translation "sont" does not 295 belong to its corresponding reference segment T i,1 .

296
This problem arises whenever the splitting sepa-297 rates a set of words from their reference, which 298 end up in the other segment. Clearly, this method 299 requires that the two languages do not have strong 300 syntactic divergence, to avoid too large mismatches 301 between S i,j and T i,j , with j = 1, 2.   Multi-split. The aforementioned methods can 324 be extended to splitting sentences in more than two 325 segments. The more we split sentences the more 326 likely it is that context is needed for each segment, 327 hence increasing training signal for contextual pa-328 rameters.

329
For more details, we refer to Section 6.3, to Ap-330 pendix A and to our code (will be open-sourced).

412
The resulting size of the two training settings 413 after pre-processing is reported in   Table 3: Comparison of accuracy of context-aware pronoun translation (ContraPro) by d&r pre-trained models with the middle-split method (first column) and the other proposed methods (relative difference). *: p < 0.01, **: p < 0.05.
comparison between single and multi-encoder mod- for the multi-split method, we split sentence-pairs 595 in a half for len(S i ) ≥ 7, and also in three seg-596 by antecedent distance, and an ablation study in which we test models on ContraPro with inconsistent context. 13 A detailed comparison between single and multi-encoder models is beyond the scope of this work.
14 More sophisticated synt-split methods could be devised, targeting other discourse phenomena, or several of them at the same time, with different degrees of priority.  Table 4: Number of coreference antecedents at a given distance d from the mention in the current sentence, for both original and split En→Fr IWSLT17. In brackets, the same figure normalized by the average number of tokens that the model has to attend to resolve the coreference (#tokens). At the bottom, the number of sentences for which at least one syntactic dependency is split in two segments when using the split data.

982
We provide here some extra details on the splitting 983 methods that have been proposed and tested. For 984 full details, we refer to our implementation. checks that:   Table 6: Accuracy(%) of Low Res models on ContraPro En→De by pronoun antecedent distance. The first column represents the weighted average, calculated on the basis of the sample size of each group. Occurrences / # of tokens Antecedents in orig. data en-fr Antecedents in orig. data en-ru Antecedents in split data en-fr Antecedents in split data en-ru Figure 4: En-Fr IWSLT vs Low Res En-Ru OpenSubti-tles2018: comparison of the number of antecedents of anaphoric pronouns at a given distance in terms of sentences/segments, normalized by the number of tokens that the model needs to attend for resolving the coreference. Since sentences are much shorter in the En-Ru data (8.32 vs. 21.02 tokens on average), the density of discourse phenomena within the sentence is much higher. der). Tomlin (2014) estimates that more than 40%     all models (P len < 1 favors shorter sentences),