Do Context-Aware Translation Models Pay the Right Attention?

Context-aware machine translation models are designed to leverage contextual information, but often fail to do so. As a result, they inaccurately disambiguate pronouns and polysemous words that require context for resolution. In this paper, we ask several questions: What contexts do human translators use to resolve ambiguous words? Are models paying large amounts of attention to the same context? What if we explicitly train them to do so? To answer these questions, we introduce SCAT (Supporting Context for Ambiguous Translations), a new English-French dataset comprising supporting context words for 14K translations that professional translators found useful for pronoun disambiguation. Using SCAT, we perform an in-depth analysis of the context used to disambiguate, examining positional and lexical characteristics of the supporting words. Furthermore, we measure the degree of alignment between the model’s attention scores and the supporting context from SCAT, and apply a guided attention strategy to encourage agreement between the two.


Introduction
There is a growing consensus in machine translation research that it is necessary to move beyond sentence-level translation and incorporate document-level context (Guillou et al., 2018;Läubli et al., 2018;Toral et al., 2018). While various methods to incorporate context in neural machine translation (NMT) have been proposed (Tiedemann and Scherrer (2017); Miculicich et al. (2018); Maruf and Haffari (2018), inter alia), it is unclear whether models rely on the "right" context that is actually sufficient to disambiguate difficult translations. Even when additional context Table 1: Translation of the ambiguous pronoun "it". In French, if the referent of "it" is masculine (e.g., report) then "il" is used, otherwise "elle". The model with regularized attention translates the pronoun correctly, with the largest attention on the referent "report". Top 3 words with the highest attention are highlighted. is provided, models often perform poorly on evaluation of relatively simple discourse phenomena (Müller et al., 2018;Bawden et al., 2018;Voita et al., 2019b,a;Lopes et al., 2020) and rely on spurious word co-occurences during translation of polysemous words (Emelin et al., 2020). Some evidence suggests that models attend to uninformative tokens  and do not use contextual information adequately (Kim et al., 2019).
To understand plausibly why current NMT models are unable to fully leverage the disambiguating context they are provided, and how we can develop models that use context more effectively, we pose the following research questions: (i) In context aware translation, what context is intrinsically useful to disambiguate hard translation phenomena such as ambiguous pronouns or word senses?; (ii) Are context-aware MT models paying attention to the relevant context or not?; and (iii) If not, can we encourage them to do so?
To answer the first question, we collect annotations of context that human translators found useful in choosing between ambiguous translation options ( §3). Specifically, we ask 20 professional translators to choose the correct French translation between two contrastive translations of an ambiguous word, given an English source sentence and the previous source-and target-side sentences. The translators additionally highlight the words they found the most useful to make their decision, giving an idea of the context useful in making these decisions. We collect 14K such annotations and release SCAT ("Supporting Context for Ambiguous Translations"), the first dataset of human rationales for resolving ambiguity in document-level translation. Analysis reveals that inter-sentential target context is important for pronoun translation, whereas intra-sentential source context is often sufficient for word sense disambiguation.
To answer the second question, we quantify the similarity of the attention distribution of contextaware models and the human annotations in SCAT ( §4). We measure alignment between the baseline context-aware model's attention and human rationales across various model attention heads and layers. We observe a relatively high alignment between self attention scores from the top encoder layers and the source-side supporting context marked by translators, however, the model's attention is poorly aligned with target-side supporting context.
For the third question, we explore a method to regularize attention towards human-annotated disambiguating context ( §5). We find that attention regularization is an effective technique to encourage models to pay more attention to words humans find useful to resolve ambiguity in translations. Our models with regularized attention outperform previous context-aware baselines, improving translation quality by 0.54 BLEU, and yielding a relative improvement of 14.7% in contrastive evaluation. An example of translations from a baseline and our model, along with the supporting rationale by a professional translator is illustrated in Table 1. 2 Document-Level Translation Neural Machine Translation. Current NMT models employ encoder-decoder architectures (Bahdanau et al., 2015;Vaswani et al., 2017). First, the encoder maps a source sequence x = (x 1 , x 2 , ..., x S ) to a continuous representation z = (z 1 , z 2 , ..., z S ). Then, given z, the decoder generates the corresponding target sequence y = (y 1 , y 2 , ..., y T ), one token at a time. Sentence-level NMT models take one source sentence and generate one target sentence at a time. These models perform reasonably well, but given that they only have intra-sentential context, they fail to handle some phenomena that require inter-sentential context to accurately translate. Well-known examples of these phenomena include gender-marked anaphoric pronouns (Guillou et al., 2018) and maintenance of lexical coherence (Läubli et al., 2018).
Document-Level Translation. Document-level translation models learn to maximize the probability of a target document Y given the source document X: P θ (Y |X) = J j=1 P θ (y j |x j , C j ), where y j and x j are the j-th target and source sentences, and C j is the collection of contextual sentences for the j-th sentence pair. There are many methods for incorporating context ( §6), but even simple concatenation (Tiedemann and Scherrer, 2017), which prepends the previous source or target sentences to the current sentence separated by a BRK tag, achieves comparable performance to more sophisticated approaches, especially in highresource scenarios (Lopes et al., 2020).
Evaluation. BLEU (Papineni et al., 2002) is most widely used to evaluate MT, but it can be poorly correlated with human evaluation (Callison-Burch et al., 2006;Reiter, 2018). Recently, a number of neural evaluation methods, such as COMET (Rei et al., 2020), have shown better correlation with human judgement. Nevertheless, common automatic metrics have limited ability to evaluate discourse in MT (Hardmeier, 2012). As a remedy to this, researchers often use contrastive test sets for a targeted discourse phenomenon (Müller et al., 2018), such as pronoun anaphora resolution and word sense disambiguation, to verify if the model ranks the correct translation of an ambiguous sentence higher than the incorrect translation.

What Context Do Human Translators
Pay Attention to?
We first conduct a user study to collect supporting context that translators use in disambiguation, and analyze characteristics of the supporting words.

Recruitment and Annotation Setup
We recruited 20 freelance English-French translators on Upwork. 2 The translators are native speakers of at least one of the two languages and have a job success rate of over 90%. Each translator is given 400 examples with an English source sentence and two possible French translations, and one out of 5 possible context levels: no context (0+0), only the previous source sentence as context (1+0), only the previous target sentence (0+1), the previous source sentence and target sentence (1+1), and the 5 previous source and target sentences (5+5). We vary the context level in each example to measure how human translation quality changes. Translators provide annotations using the interface shown in Figure 1. They are first asked to select the correct translation out of the two contrastive translations, and then highlight word(s) they found useful to arrive at their answer. In cases where multiple words are sufficient to disambiguate, translators were asked to mark only the most salient words rather than all of them. Further, translators also reported their confidence in their answers, choosing from "not at all", "somewhat", and "very".

Tasks and Data Quality
We perform this study for two tasks: pronoun anaphora resolution (PAR), where the translators are tasked with choosing the correct French gendered pronoun associated to a neutral English pronoun, and word sense disambiguation (WSD), where the translators pick the correct translation of a polysemous word. PAR, and WSD to a lesser extent, have been commonly studied to evaluate context-aware NMT models Lopes et al., 2020;Müller et al., 2018;Huo et al., 2020;Nagata and Morishita, 2020).

Pronoun Anaphora Resolution.
We annotate examples from the contrastive test set by Lopes et al. (2020). This set includes 14K examples from the OpenSubtitles2018 dataset (Lison et al., 2018) with occurrences of the English pronouns "it" and "they" that correspond to the French translations "il" or "elle" and "ils" or "elles", with 3.5K examples for each French pronoun type. Through our annotation effort, we obtain 14K examples of supporting context for pronoun anaphora resolution in ambiguous translations selected by professional human translators. Statistics on this dataset, SCAT: Supporting Context for Ambiguous Translations, are provided in Appendix A.
Word Sense Disambiguation. There are no existing contrastive datasets for WSD with a context window larger than 1 sentence, therefore, we automatically generate contrastive examples with context window of 5 sentences from OpenSubti-tles2018 by identifying polysemous English words and possible French translations. We describe our methodology in Appendix B.
Quality. For quality control, we asked 8 internal speakers of English and French, with native or bilingual proficiency in both languages, to carefully annotate the same 100 examples given to all professional translators. We compared both the answer accuracies and the selected words for each hired translator against this control set and discarded submissions that either had several incorrect answers while the internal bilinguals were able to choose the correct answer on the same example, or that highlighted contextual words that the internal annotators did not select and that had little relevance to the ambiguous word. Furthermore, among the 400 examples given to each annotator, the first hundred are identical, allowing us to measure the inter-annotator agreement for both answer and supporting context selection.
First, for answer selection on PAR, we find 91.0% overall agreement, with Fleiss' freemarginal Kappa κ = 0.82. For WSD, we find 85.9% overall agreement with κ = 0.72. This indicates a substantial inter-annotator agreement for the selected answer. In addition, we measure the inter-annotator agreement for the selected words by calculating the F1 between the word selections for each pair of annotators given identical context settings. For PAR, we obtain an average F1 of 0.52 across all possible pairs, and a standard deviation of  0.12. For WSD, we find an average F1 of 0.46 and a standard deviation of 0.12. There is a high agreement between annotators for the selected words as well. Table 2 shows the accuracy of answers and the percentage of answers being reported as not at all confident for each of the 5 different context levels. For PAR, there is a large increase in accuracy and confidence when just one previous sentence in either language is provided as context compared to no context at all. Target-side context also seems more useful than source: only target-side context gives higher answer accuracy than only source-side context, while the accuracy does not increase significantly by having both previous sentences. For WSD, we do not observe significant differences in answer accuracy and confidence between the different context levels ( Figure 2).The high answer accuracy with 0+0 context and the low rate of zero-confidence answers across all settings suggest that the necessary disambiguating information is often present in the intra-sentential context. Alternatively, this may be partially due to characteristics of the automatically generated dataset itself: we found that some examples are misaligned so the previous sentences given as context do not actually correspond to the context of the current sentences, and therefore do not add useful information. We also observe that translators tend to report a high confidence and high agreement in incorrect answers as well. This can be explained by the tendency to select the masculine pronoun in PAR ( Figure 3) or the prevailing word sense in WSD.

Answer Accuracy and Confidence
To properly translate an anaphoric pronoun, the translator must identify its antecedent and deter-  mine its gender, so we hypothesize that the antecedent is of high importance for disambiguation. In our study, 72.4% of the examples shown to annotators contain the antecedent in the context or current sentences. We calculate how answer accuracy and confidence vary between examples that do or do not contain the pronoun antecedent. We find that the presence of the antecedent in the context leads to larger variations in answer accuracy than the level of context given, demonstrating the importance of antecedents for resolution.

Analysis of the Highlighted Words
Next, we examine the words that were selected as rationales from several angles.
Distance. Figure 4 shows, for each context level, the number of highlighted words at a given distance (in sentences) from the ambiguous word. For PAR, when no previous sentences are provided, there are as many selected words from the source as the target context. With inter-sentential context, experts selected more supporting context from the target side. One possible reason is that the source and target sentences on their own are equally descriptive to perform PAR, but one may look for the coreference chain of the anaphoric pronoun in the target context to determine its gender, whereas the same coreference chain in the source context would not necessarily contain gender information. Moreover, the antecedent in the target side is more reliable Figure 4: Sentence distance of the highlighted words for each context level for PAR and WSD.
than the source antecedent, since the antecedent can have multiple possible translations with different genders. For WSD, we find that inter-sentential context is seldom highlighted, which reinforces our previous claim that most supporting context for WSD can be found in the current sentences.
Part-of-Speech and Dependency. We use spaCy (Honnibal and Montani, 2017) to predict part-of-speech (POS) tags of selected words and syntactic dependencies between selected words and the ambiguous word. In Table 3a, we find that nominals are the most useful for PAR, which suggests that human translators look for other referents of the ambiguous pronoun to determine its gender. This is reinforced by Table 3b, where the antecedent of the pronoun is selected the most often.
The main difference between PAR and WSD is that for PAR, the key supporting information is gender. The source side does not contain explicit information about the gender of the ambiguous pronoun whereas the target side may contain other gendered pronouns and determiners referring to the ambiguous pronoun. For WSD however, the key supporting information is word sense. While the source and target sides contain around the same amount of semantic information, humans may prefer to attend to source sentences that express how the ambiguous word is used in the sentence.

Model
We incorporate the 5 previous source and target sentences as context to the base Transformer (Vaswani et al., 2017) by prepending the previous sentences to the current sentence, separated by a BRK tag, as proposed by Tiedemann and Scherrer (2017).

Similarity Metrics
To calculate similarity between model attention and highlighted context, we first construct a human attention vector α human , where 1 corresponds to tokens marked by the human annotators, and 0 otherwise. We compare this vector against the model's attention for the ambiguous pronoun for a given layer and head, α model , across three metrics: Dot Product. The dot product α human · α model measures the total attention mass the model assigns to words highlighted by humans.
KL Divergence. We compute the KL divergence between the model attention and the normalized human attention vector KL(α human-norm ||α model (θ)), where the normalized distribution α human-norm is uniform over all tokens selected by humans and a very small constant elsewhere such that the sum of values in α human-norm is equal to 1.
Probes Needed. We adapt the "probes needed" metric by Zhong et al. (2019) to measure the number of tokens we need to probe, based on the model attention, to find a token highlighted by humans. This corresponds to the ranking of the first highlighted token after sorting all tokens by descending model attention. The intuition is that the more attention the model assigns to supporting context, the fewer probes are needed to find a supporting token.

Results
We compute the similarity between the model attention distribution for the ambiguous pronoun and the supporting context from 1,000 SCAT samples. In Table 5, for each attention type we report the best score across layers and attention heads. We also report the alignment score between a uniform distribution and supporting context for comparison. We find that although there is a reasonably high alignment between encoder self attention and SCAT, decoder attentions have very low alignment with SCAT.

Attention Regularization
We hypothesize that by encouraging models to increase attention on words that humans use to resolve ambiguity, translation quality may improve. We apply attention regularization to guide model attention to increase alignment with the supporting context from SCAT. To do so, we append the translation loss with an attention regularization loss between the normalized human attention vector α human-norm and the model attention vector for the corresponding ambiguous pronoun α model : where λ is a scalar weight parameter for the loss. During training, we randomly sample batches from SCAT with p = 0.2. We train with the standard MT objective on the full dataset, and on examples from SCAT, we additionally compute the attention regularization loss.

Data
For document translation, we use the English and French data from OpenSubtitles2018 (Lison et al., 2018), which we clean then split into 16M training, 10,036 development, and 9,740 testing samples. For attention regularization, we retain examples from SCAT where 5+5 context was given to the annotator. We use 11,471 examples for training and 1,000 for testing.

Models
We first train a baseline model, where the 5 previous source and target sentences serve as context and are incorporated via concatenation. This baseline model is trained without attention regularization.
We explore two models with attention regularization: (1) attnreg-rand, where we jointly train on the MT objective and regularize attention on a randomly initialized model; (2) attnreg-pre, where we first pre-train the model solely on the MT objective, then we jointly train on the MT objective and regularize attention. We describe the full setup in Appendix C.

Evaluation
As described in Section 2, we evaluate translation outputs with BLEU and COMET. In addition, to evaluate the direct translation quality of specific phenomena, we translate the 4,015 examples from Lopes et al. (2020) containing ambiguous pronouns that were not used for attention regularization, and we compute the mean word f-measure of translations of the ambiguous pronouns and other words, with respect to reference texts.
We also perform contrastive evaluation on the same subset of Lopes et al. (2020) with a context window of 5 sentences (Big-PAR) and the contrastive test sets by Bawden et al. (2018), which include 200 examples on anaphoric pronoun translation and 200 examples on lexical consistency/word sense disambiguation. The latter test sets were crafted manually, have a context window of 1 sentence, and either the previous source or target sentence is necessary to disambiguate.
Context-aware models often suffer from error propagation when using previously decoded output tokens as the target context (Li et al., 2020a). Therefore, during inference, we experiment with both using the gold target context (Gold) as well as using previous output tokens (Non-Gold).

Overall Performance
Before delving into the main results, we note that we explored regularizing different attention vectors in the model (Appendix C.3) and obtain the best BLEU and COMET scores for attnreg-rand when regularizing the self-attention of the top encoder layer, cross-attention of the top decoder layer and self-attention of the bottom decoder layer. For attnreg-pre, regularizing self-attention in the top decoder layer gives the best scores. Thus, we use these as the default regularization methods below.
Moving on to the main results in Table 6, we observe that attnreg-rand improves on all metrics, which demonstrates that attention regularization is an effective method to improve translation quality. Although attnreg-pre does not improve gen-eral translation scores significantly, it yields considerable gains in word f-measure on ambiguous pronouns and achieves some improvement over the baseline on contrastive evaluation on Big-PAR and PAR. Attention regularization with supporting context for PAR seems to especially improve models on similar tasks. The disparity between BLEU/COMET scores and targeted evaluations such as word f-measure and contrastive evaluation further suggests that general MT metrics are somewhat insensitive to improvements on specific discourse phenomena. For both models with attention regularization, there are no significant gains in WSD. As discussed in §3.4, WSD and PAR require different types of supporting context, so it is natural that regularizing attention using supporting context extracted from only one task does not always lead to improvement on the other.

Analysis
We now investigate how models trained with attention regularization handle context differently compared to the baseline model.

How does attention regularization influence alignment with human rationales?
We revisit the similarity metrics from §4.2 to measure alignment with SCAT. In Table 5, the dot product alignment over attention in the decoder increases with attention regularization, suggesting that attention regularization guides different parts of the model to pay attention to useful context. Interestingly, although only the encoder self-attention was explicitly regularized for attnreg-pre, the model seems to also have learned better alignment for attention in the decoder. Moreover, attnreg-pre generally has better alignment than attnreg-rand, suggesting that models respond more to attention regularization once it has been trained to perform translation.
Which attention is the most useful? For each of attnreg-rand and attnreg-pre, we perform attention regularization on either the encoder selfattention, decoder cross-attention or decoder selfattention only. In Table 7, encoder self-attention seems to contribute the most to both translation performance and contrastive evaluation. Although attnreg-rand models achieve higher BLEU and COMET scores, attnreg-pre obtain higher scores on metrics targeted to pronoun translation. Attention regularization seems to have limited effect on WSD performance, the scores vary little between attention types.    How much do models rely on supporting context? We compare model performance on contrastive evaluation on SCAT when it is given full context, and when we mask either the supporting context, random context words with p = 0.1, the source context, the target context, or all of the context. In Table 8, we find that baseline varies little when the supporting context is masked, which again suggests that context-aware baselines do not use the relevant context, although they do observe a drop in contrastive performance when the source and all context are masked. Models with attention regularization, especially attnreg-pre observe a large drop in contrastive performance when supporting context is masked, which indicates that they learned to rely more on supporting context. Furthermore, for attnreg-pre, the score after masking supporting context is significantly lower than when masking all context, which may indicate that having irrelevant context can have an adverse ef-fect. Another interesting finding is that both baseline and attnreg-rand seem to rely more on the source context than the target context, in contrast to human translators. This result corroborates prior results where models have better alignment with supporting context on attention that attends to the source (encoder self-attention and decoder crossattention), and regularizing these attention vectors contributes more to translation quality than regularizing the decoder self-attention.
6 Related Work

Context-Aware Machine Translation
Most current context-aware NMT approaches enhance NMT by including source-and/or targetside surrounding sentences as context to the model. Tiedemann and Scherrer (2017)  However, recent studies suggest that current context-aware NMT models often do not use context meaningfully. Kim et al. (2019) claim that improvements by context-aware models are mostly from regularization by reserving parameters for context inputs, and Li et al. (2020b) show that replacing the context in multi-encoder models with random signals leads to similar accuracy as using the actual context. Our work addresses the above disparities by collecting human supporting context to regularize model attention heads during training.

Attention Mechanisms
Though attention is usually learned in an unsupervised manner, recent work supervises attention with word alignments (Mi et al., 2016;Liu et al., 2016), event arguments and trigger words Zhao et al., 2018), syntactic dependencies (Strubell et al., 2018) or word lexicons (Zou et al., 2018). Our work is closely related to a large body of work that supervises attention using human rationales for text classification (Barrett et al., 2018;Bao et al., 2018;Zhong et al., 2019;Choi et al., 2020;Pruthi et al., 2020). Our work, however, is the first to collect human evidence for document translation and use it to regularize the attention of NMT models.

Implications and Future Work
In this work, we collected a corpus of supporting context for translating ambiguous words. We examined how baseline context-aware translation models use context, and demonstrated how context annotations can improve context-aware translation accuracy. While we obtain promising results for context-aware translation by testing one method for attention regularization, our publicly available SCAT dataset could enable future research on alternative attention regularizers. Moreover, our analyses demonstrate that humans rely on different types of context for PAR and WSD in English-French translation, similar user studies can be conducted to better understand the usage of context in other ambiguous discourse phenomena, such as ellipsis, or other language pairs. We also find that regularizing attention using SCAT for PAR especially improves anaphoric pronoun translation, suggesting that supervising attention using supporting context from different tasks may help models resolve other types of ambiguities.
One caveat regarding our method for collecting supporting context from humans is the difference between translation, translating text from the input, and disambiguation, choosing between translation candidates. During translation, humans might pay more attention to the source sentences to understand the source material, but during disambiguation, we have shown that human translators rely more often on the target sentences. One reason why the model benefits more from increased attention on source may be because the model is trained and evaluated to perform translation, not disambiguation. A future step would be to explore alternative methods for extracting supporting context, such as eye-tracking during translation (O'Brien, 2009

B Generating Data for Word Sense Disambiguation
To automatically generate contrastive examples of WSD, we identify English words that have multiple French translations. To do so, we first extract word alignments from OpenSubtitles2018 using AWESOME-align (Dou and Neubig, 2021) and obtain: where for each word pair x i , y j , x i and y j are semantically similar to each other in context. For each pair x i , y j ∈ A m , we compute the number of times the lemmatized source word type (v x = lemma(x i )) along with its POS tag (t x = tag(x i )) is aligned to the lemmatized target word type (v y = lemma(y j )): c(v x , t x , v y ). Then, we extract tuples of source types with its POS tags v x , t x that have at least two target words that have been aligned at least 50 times (|{v y |c(v x , t x , v y ) ≥ 50}| ≥ 2). Finally, we filter out the source tuples which have an entropy H(v x , t x ) less than a pre-selected threshold z. This entropy is computed using the conditional probability of a target translation given the source word type and its POS tag as follows: is the set of target translations for the source tuple v x , t x and p(v y |v x , t x ) is the conditional probability of a given target translation v y for the source word type v x and its POS tag t x Out of the 394 extracted word groups, we manually validate and retain 201 groups and then classify them into 64 synonymous and 137 nonsynonymous word groups (Table 10). We create contrastive translations by extracting sentence pairs containing an ambiguous word pair, and replacing the translation of the polysemous English word by a different French word in the same group. For word groups with synonymous French words, we only retain examples where the French word appears within the previous 5 sentences to enforce lexical consistency, as otherwise the different French words may be interchangeable.

C.1 Data preprocessing
We use the English and French data from the publicly available OpenSubtitles2018 dataset (Lison et al., 2018). We first clean the data by selecting sentence pairs with a relative time overlap between source and target language subtitle frames of at least 0.9 to reduce noise. Each data is then encoded with byte-pair encoding (Sennrich et al., 2016) using SentencePiece (Kudo and Richardson, 2018), with source and target vocabularies of 32k tokens.

C.2 Training configuration
We follow the Transformer base (Vaswani et al., 2017) configuration in all our experiments, with N = 6 encoder and decoder layers, h = 8 attention heads, hidden size d model = 512 and feedforward size d ff = 2048. We use the learning rate schedule and regularization described in Vaswani et al. (2017). We train using the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98.  Table 11: Results of all models with regularized attention. AR: attnreg-rand, AP: attnreg-pre

C.3 Attention regularization setups
For both attnreg-rand and attnreg-pre, we experiment performing regularization on different model attentions at different layers. For the attention regularization loss, In all experiments, we compute the attention regularization loss on the first attention head with λ = 10, and we divide the loss by the length of the input. We give the results for all setups in Table 11.