When Does Translation Require Context? A Data-driven, Multilingual Exploration

Although proper handling of discourse significantly contributes to the quality of machine translation (MT), these improvements are not adequately measured in common translation quality metrics. Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation, however not in a fully systematic way. In this paper, we develop the Multilingual Discourse-Aware (MuDA) benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena in any given dataset. The choice of phenomena is inspired by a novel methodology to systematically identify translations that require context. This methodology confirms the difficulty of previously studied phenomena while uncovering others which were not previously addressed. We find that commonly studied context-aware MT models make only marginal improvements over context-agnostic models, which suggests these models do not handle these ambiguities effectively. We release code and data for 14 language pairs to encourage the MT community to focus on accurately capturing discourse phenomena. Code available at https://github.com/neulab/contextual-mt


Introduction
In machine translation (MT), information from previous utterances has been found crucial to adequately translate a number of discourse phenomena including anaphoric pronouns, lexical cohesion, and discourse markers (Guillou et al., 2018;Läubli et al., 2018;Toral et al., 2018).However, while generating proper translations of these phenomena is important, they represent only a small portion of the words in natural language data.Because of this, common metrics such as BLEU (Papineni et al., 2002) do not provide a clear picture of whether they are appropriately captured or not.
Recent work on neural machine translation (NMT) models that attempt to incorporate extrasentential context (Tiedemann and Scherrer, 2017 Table 1: Some representative works on contextual machine translation that perform evaluation on discourse phenomena, contrasted to our work.For a more complete review see Maruf et al. (2021).
phenomena that to our knowledge have not been addressed previously (e.g.consistency of verb forms), without requiring a-priori language-specific knowledge.Finally, we design a series of methods to automatically tag words belonging to the identified classes of ambiguities ( §4) and we evaluate existing translation models for different categories of ambiguous translations ( §5).
We perform our study on a parallel corpus spanning 14 language pairs, measuring translation ambiguity and model performance.We find that the context-aware methods, while improving on standard evaluation metrics, only perform better than the context-agnostic baselines for certain discourse phenomena in our benchmark, while on other phenomena, context-aware models do not observe significant improvements.Our benchmark therefore provides a more fine-grained evaluation of translation models and reveals the weaknesses of contextaware models, such as verb form cohesion.We also find that DeepL, a commercial document-level translation system, does better in our benchmark than its sentence-level ablation and Google Translate.We hope that the released benchmark and code, as well as our findings, will spur targeted evaluation of discourse phenomena in MT to cover more languages and more phenomena in the future.
2 Measuring Context Usage

Cross-Mutual Information
While document-level MT models can be compared using standard translation metrics such as BLEU (Papineni et al., 2002), they do not provide a clear picture of whether models are performing better due to improvements in processing context or other improvements (Kim et al., 2019).Another common evaluation paradigm is contrastive evaluation, which evaluates contextual models' ability to distinguish between correct and incorrect translations of specific discourse phenomena, such as anaphora resolution (Müller et al., 2018) and lexical cohesion (Bawden et al., 2018).However, this provides only a limited measure of context usage on a limited set of ambiguous phenomena defined by the creators of the dataset, not capturing other unanticipated ways in which the model might need context (Vamvas and Sennrich, 2021).We are therefore interested in devising a metric that is able to capture all context usage by a model, beyond a predefined set.
Conditional Cross-Mutual Information (CXMI) (Bugliarello et al., 2020;Fernandes et al., 2021) measures the influence of context on model predictions.CXMI is defined as: where X and Y are a source and target sentence, respectively, C is the context, H q M T A is the entropy of a context-agnostic MT model, and H q M T C refers to a context-aware MT model.This quantity can be estimated over a held-out set with N sentence pairs and the respective context as: Importantly, the authors find that training a sin-   and Hanks, 1990) measures the association between two random variables for specific outcomes.Mutual information can be seen as the expected value of P-MI over all possible outcomes of the variables.Taking inspiration from this, we define the Pointwise Cross-Mutual Information (P-CXMI) for a source, target, context triplet (x, y, C) as: Intuitively, P-CXMI measures how much more 130 (or less) likely a target sentence y is when it is 131 given context C, compared to not being given that 132 context.Note that this is estimated according to 133 the models q M T A and q M T C since, just like CXMI, 134 this measure depends on their learned distributions.

135
We can also apply P-CXMI at the word level (as opposed to the sentence level) to measure how much more likely a particular word in a sentence is when it is given the context, by leveraging the autoregressive property of the neural decoder.Given Avelile's mother had HIV virus.Avelile had the virus, she was born with the virus.
Louis XIV had a lot of people working for him.They made his silly outfits, like this.
They're the ones who know what society is going to be like in another generation.I don't.
Table 2: Examples of high P-CXMI tokens and corresponding linguistic phenomena.Contextual sentences are italicized.The high P-CXMI target token is highlighted in pink, source and contextual target tokens related to the high P-CXMI token are highlighted in blue and green respectively.the triplet (x, y, C) and the word index i, we can measure the P-CXMI for that particular word as: Note that nothing constrains the form of C or even 136 x and P-CXMI can, in principle, be applied to any 137 conditional language modelling problem.To obtain the P-CXMI for words in the data, we train a small Transformer (Vaswani et al., 2017) model for every target language and incorporate the target context by concatenating it to the current target sentence (Tiedemann and Scherrer, 2017).
We train the model with dynamic context size (Fernandes et al., 2021), by sampling between 0 and 3 target context sentences and estimate P-CXMI by using this model both q M T A and q M T C (more training details in Appendix D).

Analysis Procedure
We adopt a top-down approach and start our analysis by studying POS tags with high mean P-CXMI.
In Appendix B, we report the mean P-CXMI for selected POS tags on our test data.Some types of ambiguity, such as dual form pronouns ( §3.3), can be linked to a single POS tag and be identified at this step, whereas others require finer inspection.
Next, we inspect the vocabulary items with high mean P-CXMI.At this step, we can detect phenomena that are reflected by certain lexical items that consistently benefit from context for translation.
Finally, we examine individual tokens that obtain the highest P-CXMI.In doing so, we identify patterns that do not depend on lexical features, but rather on syntactic constructions for example.
In Table 2, we provide selected examples of tokens that have high P-CXMI and the discourse phenomenon we have identified from them.

Identified Phenomena
Through our thematic analysis of P-CXMI, we identified various types of translation ambiguity.Unlike previous work, our method requires no prior knowledge of the languages to find relevant discourse phenomena and easily scales to new languages ( §4.4).
First, we find high P-CXMI for second-person pronouns (PRON.2) in languages with T-V distinction (Appendix A, "Pronouns Politeness").While English uses the same second-person pronouns for everyone, in these languages, certain pronouns depend on the level of formality and relationship between the speaker and addressee.Furthermore, languages such as Japanese and Korean use honorifics to indicate formality.In Japanese, vocabulary items such as "ござい" / "じゃ" that control formality have high mean P-CXMI (0.42 / 0.34).
In English, only the 3rd person singular pronoun is gendered and gender is assigned based solely on semantic rules (Appendix A, "Gendered Pronouns", "Gender Assignment").We find several languages with high P-CXMI on pronouns (PRON), and these languages use gendered pronouns for pronouns other than the 3rd person singular or assign gender using formal rules (German, French, Hebrew, Italian, Portuguese, Russian, and Chinese).
When translating a gender-neutral English pronoun to a gendered target pronoun, context is therefore needed to determine the gender of the antecedent.
We find high P-CXMI for certain verb forms, such as the imperfect form in Spanish Italian and Romanian (VERB.Imp).While English verbs may have five forms (e.g.write, writes, wrote, written, writing), other languages often have a more finegrained verb morphology.For example, English has only a single form for the past tense, while the Spanish past tense consists of six verb forms.Verbs must be translated using the verb form that reflects the tone, mood and cohesion of the document.
When we inspect vocabulary items with the highest mean P-CXMI scores, we often find names of entities (e.g. the Japanese translation of Mandela " マンデラ " has mean P-CXMI of 0.36).As in the first row of Table 2, proper nouns may have multiple possible translations, but the same entity should be referred to by the same word in a translated document for lexical cohesion (Carpuat, 2009).
Finally, among the individual tokens with the highest P-CXMI, we find that many are due to  In this section, we describe our taggers for each 287 discourse phenomenon we identified.In doing so,

288
we create more reliable and informative taggers for Verb Form For each target language, we define a where v i ∈ V if there exists a verb form in English u j and an alternate verb form v k = v i in the target language such that an English verb with form u j may be translated to a target verb with form v i or v k depending on the context.Then, for each target token y j , if y j is a verb of form v j ∈ V , and another verb with form v j has appeared previously in the same document, we tag y j as ambiguous. Ellipsis

Evaluation of Automatic Tags
We apply the MuDA tagger to the reference translations of our TED talk data.We thus obtain an evaluation set of 3,385 parallel sentences for each of the 14 language pairs.In Appendix B we report the mean P-CXMI for each language and MuDA tag.Overall, we find higher P-CXMI on tokens with a tag compared to those without, which provides empirical evidence that models indeed rely on context to predict words with MuDA tags.

Trained Models
We train a sentence-level and document-level concatenation-based small transformer (base) for every target language.While conceptually simple, concatenation approaches have been shown to outperform more complex models when properly trained.For the context-aware model, the major difference from §3.1 is that we use a static context size of 3, since we are not using these models to measure P-CXMI.(Lopes et al., 2020).
To evaluate stronger models, we additionally train a large transformer model (large) that was pretrained on a large, sentence-level corpora, for German, French, Japanese and Chinese.Further training details can be found in Appendix D.

Commercial Models
To assess if commercially available machine trans-

478
Table 5 shows the results for base models,

486
First, we find that BLEU are highest for 487 context-gold models for most language pairs, but context-agnostic models have higher COMET scores.Moreover, in terms of mean word f-measure overall, we do not find significant differences between the three systems.It is therefore difficult to see which system performs the best on documentlevel ambiguities using only corpus-level metrics.
For words tagged by MuDA as requiring context for translation, context-aware models often achieve higher word f-measure than context-agnostic models on certain tags such as ellipsis and formality, but on other tags such as lexical and verb form, they do not significantly outperform the context-agnostic models.This demonstrates how MuDA allows us to identify what kind of inter-sentential ambiguities context-aware models are able to resolve or not.
For the pretrained large models (Table 6), context-aware models perform better than the context-agnostic on corpus-level metrics, especially COMET.On words tagged with MuDA, context-aware models generally obtain the highest f-measure as well, particularly when given reference context, especially on phenomena such as lexical and pronouns, but the improvements are less pronounced than on corpus-level evaluation.
Among commercial engines (Table 7), DeepL seems to outperform Google on most metrics and language pairs.Also, the sentence-level ablation of DeepL performs worse than its document-level system for most MuDA tags, which further suggests DeepL is able to process context to some extent.
Overall, current context-aware MT systems seem to translate some inter-sentential discourse phenomena well, but they are still unable to consistently ob-creates a dataset for anaphora resolution, deixis, el-

Conclusions and Future Work
In this work, we investigate the types of ambiguous translations where MT models benefit from context using our proposed P-CXMI metric.Our datadriven thematic analysis helps us identify contextsensitive discourse phenomena, some of which (such as verb forms) have not been addressed in prior works on context-aware MT, for 14 language pairs.The advantages of our approach is that it is systematic and does not require a-priori languagespecific knowledge to identify these phenomena, so we believe that our methodology can be easily extended to other language pairs.P-CXMI can also be used to identify types of context-dependent words for tasks outside MT.Based on our findings, we then construct the MuDA benchmark that tags words in a given parallel corpus and evaluate models on 5 context-dependent discourse phenomena.
We find that ellipsis is the most challenging to tag with high precision and we leave improvements to model cross-lingual ellipsis for future work.
Our evaluation using MuDA reveals that both context-aware and commercial translation systems achieve small improvements over context-agnostic models on many discourse-aware translations, and we encourage using MuDA to benchmark the development of models that address these ambiguities.

A Language Properties
Table 8 summarizes the properties of the languages analyzed in this work.

B P-CXMI Results
Table 9 presents the average P-CXMI value per POS tag and per MuDA tag.

C.1 Formality Words
Table 10 gives the list of words related to formality for each target language.

C.2 Ambiguous Pronouns
Table 11 provides English pronouns and the list of possible target pronouns.

C.3 Ambiguous Verbs
Table 12 lists verb forms that may require disambiguation during translation.

C.4 Ellipsis Classifier
We train a BERT text classification model (Devlin et al., 2019) on data from the Penn Treebank, where we labeled each sentence containing the tag '*?*' as containing ellipsis (Bies et al., 1995).We obtain 248,596 sentences total, with 2,863 tagged as ellipsis.Then, our model using HuggingFace Transformers (Wolf et al., 2020).To address the imbalance in labels, we up-weight the loss for samples tagged as ellipsis by a factor of 100.

D Training details
The transformer-small model has hidden size of 512, feedforward size of 1024, 6 layersa and 8 attention heads.The transformer-large model has hidden size of 1024, feedforward size of 4096, 6 layers, 16 attention heads.
As in Vaswani et al. (2017), we train using the Adam optimizer with β 1 = 0.9 and β 2 = 0.98 and use an inverse square root learning rate scheduler, with an initial value of 10 −4 for large model and 5 × 10 −4 for the base and multi models, with a linear warm-up in the first 4000 steps.
For the pretrained models we used Paracrawl (Esplà et al., 2019) for German and French, JParacrawl (Morishita et al., 2020) for Japanese and the Backtranslated News from WMT2021 for Chinese.
118gle model q M T as both the context-agnostic and 119 context-aware model ensures that non-zero CXMI 120 values are due to context and not other factors (see 121 Fernandes et al. (2021) and §3.1 for details).
Usage Per Sentence and Word 123 CXMI measures the context usage by a model 124 by comparing the log-likelihood ratio of samples 125 across the whole corpus.However, for our pur-126 poses, we are interested in measuring how much 127 the context is helpful for single sentences or even 128 just particular words in a sentence.129 Pointwise Mutual Information (P-MI) (Church

138
Using this metric, we now ask: what kind of 139 words tend to see their likelihood increase when 140 given the context?Such words should have a high Turkish and Mandarin Chinese.These 14 target languages are chosen for their high availability of TED talks and linguistic tools, as well as for the diversity of language types in our comparative study (Table 8 in Appendix A).For each language pair, our dataset contains 113,711 parallel training sentences from 1,368 talks, 2,678 development sentences from 41 talks, and 3,385 testing sentences from 43 talks.

422
Exhaustively listing all relevant phenomena in 423 document-level MT is extremely complex and be-424 yond the scope of our paper.To identify new dis-425 course phenomena on other languages, our the-426 matic analysis can be reused as follows: (1) Train a model with dynamic context size on translation between the new language pair; (2) Use the model to compute P-CXMI for words in a parallel documentlevel corpus of the language pair; (3) Manually analyze the POS tags, vocabulary items and individual tokens with high P-CXMI; (4) Link patterns of tokens with high P-CXMI to particular discourse phenomena by consulting linguistic resources.5 Exploring Context-aware MT Next, we use our MuDA benchmark to perform an initial exploration of context usage across 14 languages pairs and 4 models, including those we trained ourselves and commercial systems.

479
trained either without context (no-context) or 480 with context, and for the latter with either pre-481 dicted context (context) or reference context 482 (context-gold) during decoding.Results are 483 reported with respect to standard MT metrics such 484 as BLEU (Papineni et al., 2002) and COMET (Rei 485 et al., 2020), as well as the MuDA benchmark.

552
lipsis and lexical cohesion in EN → RU.However, 553Yin et al. (2021) suggest that the task of translating and disambiguating between two contrastive choices are inherently different, which motivates our approach in measuring direct translation performance through evaluation of word f-measure. ;

Table 3 :
Number of MuDA tags on TED test data.

Table 2 ,
the English text does not repeat the verb 255know in the second sentence as it can be understood 256 from the previous sentence.However, in Turkish,257there is no natural way to translate the verb-phrase 258 ellipsis and must infer that "don't" refers to "don't 259 know", and translate the verb accordingly.260Althoughthisproceduremay tend to find phe-261 nomena that are intuitive to the annotators, the data-262 driven approach makes confirmation bias less se-263 vere than prior works relying on introspection to 264 identify phenomena.Hence, our procedure can al-265 low us to discover relevant phenomena that have 266 not been previously addressed, such as verb forms.267 4 Cross-phenomenon MT Evaluation 268 After identifying a set of linguistic phenomena 269 where context is useful to resolve ambiguity dur-270 ing translation, we develop a series of methods 271 to automatically tag tokens belonging to these 272 classes of ambiguous translations and propose 273 the Multilingual Discourse-Aware (MuDA) bench-274 mark for context-aware MT models.2754.1 MT Evaluation Framework 276 Given a pair of parallel source and target docu-277 ments (X, Y ), our MuDA tagger assigns a set of 278 discourse phenomena tags {t 1 i , • • • , t n i } to each tar-279 get token y i ∈ Y .Then, using the compare-mt 284 more or less accurately.2854.2 Automatic Tagging 286

Table 3
contain enough examples of ellipsis.Further, languages from a different family than English have a relatively high number of ellipsis tags.Korean and especially Japanese have more formality tags than languages with T-V distinction, which is aligned

Table 4 :
Precision of MuDA tags on 50 utterances.
395We paid them 20$/hour.This allows us to measure 396 how many automatic tags violate the given defini-397 tion of the linguistic tag.Table 4 reports the tags' 398 precision.399Forall languages, we obtain high precision for 400 all tags except ellipsis, confirming that the method-401 ology can scale to languages where no native speak-402 ers were involved in developing the tags.For ellip-403 sis, false positives often come from one-to-many or 404 non-literal translations, where the aligner does not 405 align all target words to the corresponding source 406 word.We believe that the ellipsis tagger is still 407 useful in selecting difficult examples that require 408 context for translation; despite the low precision, 409 we find a significantly higher P-CXMI on ellipsis 410 words for many languages (Appendix B). 1 411 4.4 Extension to New Languages 412 While MuDA currently supports 14 language pairs, 413 our methodology can be easily extended to new lan-414 guages.The lexical and ellipsis tags can be directly

Table 5 :
BLEU, COMET, and Word f-measure per tag for base context-aware models.BLEU, COMET and word f-measures statistically significantly higher than no-context (p < 0.05) are underlined.
ments into an API request.To re-segment the translation into sentences, we include special marker tokens in the source that are preserved during translation and split the translation on those tokens.We

Table 6 :
Word f-measure per tag for large models.BLEU, COMET, word f-measures statistically significantly higher than no-context (p < 0.05) are underlined.
also evaluate a sentence-level version of DeepL 475 where we feed each sentence separately to compare 476 with its document-level counterpart.477

Table 8 :
Properties of the languages in our study.

Table 10 :
Words related to formality for each target language.

Table 11 :
Ambiguous pronouns w.r.t.English for each target language.

Table 12 :
Ambiguous verb forms w.r.t.English for each target language.