Improving Long Context Document-Level Machine Translation

Document-level context for neural machine translation (NMT) is crucial to improve the translation consistency and cohesion, the translation of ambiguous inputs, as well as several other linguistic phenomena.Many works have been published on the topic of document-level NMT, but most restrict the system to only local context, typically including just the one or two preceding sentences as additional information.This might be enough to resolve some ambiguous inputs, but it is probably not sufficient to capture some document-level information like the topic or style of a conversation.When increasing the context size beyond just the local context, there are two challenges: (i) the memory usage increases exponentially (ii) the translation performance starts to degrade.We argue that the widely-used attention mechanism is responsible for both issues.Therefore, we propose a constrained attention variant that focuses the attention on the most relevant parts of the sequence, while simultaneously reducing the memory consumption.For evaluation, we utilize targeted test sets in combination with novel evaluation techniques to analyze the translations in regards to specific discourse-related phenomena.We find that our approach is a good compromise between sentence-level NMT vs attending to the full context, especially in low resource scenarios.


Introduction
Machine translation (MT) is the task of mapping some input text onto the corresponding translation in the target language.MT systems typically operate on the sentence-level and utilize neural networks trained on large amounts of bilingual data (Bahdanau et al., 2014;Vaswani et al., 2017).These neural machine translation (NMT) systems perform remarkably well on many domains and language pairs, sometimes even on par with professional human translators.However, when the automatic translations are evalu-ated on the document-level (e.g. the translation of a whole paragraph or conversation is evaluated), they reveal shortcomings regarding consistency in style, entity-translation or correct inference of the gender, among other things (Läubli et al., 2018;Müller et al., 2018;Thai et al., 2022).The goal of document-level NMT is to resolve these shortcomings by including context information as additional input when translating a sentence.
In recent years, many works have been published on the topic of document-level NMT.However, most of these works focus only on including a few surrounding sentences as context.When the context size is increased beyond that, typically a degradation of overall translation performance is reported.Additionally, the transformer architecture as the quasi standard in NMT seems sub optimal to handle long sequences as input/output, since the memory complexity increases quadratically with the sequence length.This is due to the attention mechanism, where each token in a sequence needs to attend to all other tokens.
In this work, we propose a constrained attention variant for the task of document-level NMT.The idea is to reduce the memory consumption while at the same time focusing the attention of the system onto the most relevant parts of the sequence.Our contributions are two-fold: 1. We observe that the attention patterns become less focused on the current sentence when increasing the context-size of our documentlevel NMT systems.Therefore we propose a constrained attention variant that is also more memory efficient.
2. We utilize a targeted evaluation method to assess automatic translations in regards to consistency in style and coreference resolution.We find that our document-level NMT approach performs among the best across all language-pairs and test scenarios.
Many works have been published on the topic of document-level NMT.The widely used baseline approach consists of simply concatenating a few adjacent sentences and feeding this as an input to the MT system, without modifying the system architecture in any way (Tiedemann and Scherrer, 2017;Bawden et al., 2018;Agrawal et al., 2018;Talman et al., 2019;Nguyen et al., 2021;Majumde et al., 2022).Also, several modifications to this baseline concatenation approach have been proposed.Ma et al. (2020) introduce segment embeddings and also partially constrain the attention to the tokens of the current sentence.Zhang et al. (2020) propose to calculate the self-attention both on the sentence-and on the document-level and then combine the two representations.Fernandes et al. (2021) and Lei et al. (2022) both mask out tokens in the current sentence to increase context utilization while Yang et al. (2023) remove tokens from the context if they are not attended.Typically, slight improvements in BLEU are reported as well as more significant improvements on targeted test sets e.g. for coreference resolution.
While the concatenation approach works well for short context sizes, when used with a larger number of context sentences, typically performance degradation is reported: Scherrer et al. (2019) saw a severe performance degradation when using input sequences with a length of 250 tokens.Liu et al. (2020) could not get their system to converge when using context sizes of up to 512 tokens.They improve training stability by adding additional monolingual data via pre-training.Bao et al. (2021) also report that their systems with context length of more than 256 tokens fail to converge.They propose to partially constrain the attention to the current sentence, similar to Zhang et al. (2020).Sun et al. (2022) try to translate full documents with the concatenation approach but could not get their system to converge during training.Their solution is to mix document-and sentence-level data, which reportedly improves system convergence.Li et al. (2022) report severe performance degradation for context sizes longer than 512 tokens.They argue this is due to insufficient positional information and improve performance by repeatedly injecting this information during the encoding process.However, increasing the context size seems to not always result in performance degradation.In their works, Junczys-Dowmunt (2019) and Saleh et al. (2019) train systems with a context size of up to 1000 tokens without degradation in translation quality, which stands in contrast to the works mentioned above and which we will discuss again in the context of our own results.We want to point out that all of the approaches mentioned above still have the problem of quadratically increasing resource requirements, which poses a big challenge even on modern hardware.
Since our proposed approach consists of modifying the attention matrix in the model architecture, we give a brief overview of previous works related to this concept.The works of Ma et al. (2020), Zhang et al. (2020) and Bao et al. (2021) are most closely related and were already mentioned above.All three papers restrict the attention (partially) to the current sentence and combine sentence-and document-level attention context vectors for the final output.However, this means all approaches still suffer from the quadratic dependency on the number of input tokens.Luong et al. (2015) were among the first to propose using the attention concept for the task of MT.They also proposed using a sliding-window with target-to-source alignment for attention similar to us.However, they only work on sentence-level NMT and to the best of our knowledge, this approach was never before transferred to document-level NMT.Shu and Nakayama (2017) and Chen et al. (2018) both extend the approach of Luong et al. (2015) while still working solely on sentence-level NMT.Our approach is also related to the utilization of relative positional encoding, which was introduced by Shaw et al. (2018) and later extended by Yang et al. (2018) to be applicable for cross-attention.The work by Indurthi et al. (2019) should also be mentioned, where they pre-select a subset of source tokens on which to perform attention on.Again, all of the above mentioned works only perform experiments on sentence-level NMT.The works of Child et al. (2019), Sukhbaatar et al. (2019) and Guo et al. (2020) are also related, since they use attention windows similar to us for tasks other than MT.
Finally, we briefly want to touch on the subject of automatic evaluation of document-level MT systems.Many works only report results on general MT metrics like BLEU (Papineni et al., 2002), sometimes matching n-grams across sentenceboundaries.However, it has been argued that these metrics do not capture well the very specific improvements that could be expected by including document-level context and that the reported improvements rather come from regularization effects and comparing to sub optimal baseline performance (Kim et al., 2019;Li et al., 2020;Nguyen et al., 2021).Several targeted test suites have been released to better assess the improvements gained by document-level NMT (Müller et al., 2018;Bawden et al., 2018;Voita et al., 2019;Jwalapuram et al., 2019).These test suites have some limitations, for example they are language-specific and they are based on just scoring predefined contrastive examples without scoring the actual translations.More recently, Jiang et al. (2022) and Currey et al. (2022) have released frameworks that allow to score the actual MT hypotheses in regards to their consistency regarding specific aspects of the translation.

Methodology
Here, we explain the baseline concatenation approach (Section 3.1), the more refined method that we are comparing ourselves against (Section 3.2) as well as our own approach (Section 3.3).We also discuss our different evaluation approaches in Section 3.5.

The Baseline Concatenation Approach
The baseline concatenation approach is very simple and follows Tiedemann and Scherrer (2017) using the vanilla transformer architecture (Vaswani Model Context Attn.
If we want our model to have a context length of k sentences, we simply concatenate the current input sentence with its k −1 predecessor sentences and the input to the model would be while on the target side we include the preceding sentences as a prefix We use a special token <sep> as a separator between adjacent sentences and <eos> denotes the end of the sequence.This is done to make it easier for the model to distinguish between the sentence that needs to be translated and the context.Furthermore, we use a special token F 0 = E 0 = <bod> to denote the start of a document.Since we use the vanilla transformer architecture with self-attention and cross-attention components, the memory usage is O(L 2 ) with L being the sequence length.
When we train full document-level systems, we simply concatenate all sentences in the document using again the special <sep> token.Due to hardware limitations, if the length of the target-side of the document exceeds 1000 tokens, we split the document into smaller parts of roughly equal length (i.e. a document of length 1500 tokens would be split into two parts with ca.750 tokens each).
In a preliminary study, we train systems using no context (sentence-level), just a single sentence as context as well as the maximum context size of 1000 tokens.When looking at the percentage of attention that is payed to the n-th source sentence F n when decoding the n-th target sentence E n (extracted from cross-attention module, see Table 1) we find that this percentage becomes lower as the context size increases.This finding motivates us to explore approaches that bias the attention towards the current sentence.

LST-attention
This method was proposed by Zhang et al. (2020) and is called Long-Short Term (LST) attention.The authors find that their approach outperforms the baseline concatenation approach but they only use a maximum of 3 sentences as context.Nevertheless we deem this approach promising, since it also focuses the attention onto the current sentence.The input to the system is augmented in the same way as described in Section 3.1.Given some queries Zhang et al. (2020) formulate their restricted version of the attention as1 with d being the hidden dimension of the model and M ∈ R I×J being the masking matrix.This masking matrix is defined as where s(•) ∈ 1, .., N is a function that returns the sentence index that a certain position belongs to.This means we are restricting the attention to be calculated only within the current sentence.For selfattention in the encoder and the decoder, Zhang et al. (2020) calculate both the restricted and the non-restricted variant and then combine the output context-vectors via concatenation and a linear transformation.The cross-attention between encoder and decoder remains unchanged in this approach and the memory consumption remains O(L 2 ).

window-attention
This method is proposed by us.We can use the same formulation as above to describe this approach by simply changing the definition of the attention mask to where w is the window size and b i ∈ 1, ..., J is a target-source alignment.This means a certain query vector q i is only allowed to attend to the key vectors k j that surround the position b i that this query vector is aligned to.We replace all selfattention and cross-attention modules in our network with this window-attention variant.Please note that in practice we do not calculate this mask, but instead we first select the corresponding keyvectors for each query and then calculate attention only between these subsets which reduces the memory consumption from O(L 2 ) to O(L • w).
We also want to point out that with this approach, the context is not as restricted as it seems on first glance.For any individual attention module, the context is restricted to 2 • w or w for self-attention in the encoder and decoder respectively.However, since in the transformer architecture we stack multiple layers, we get a final effective context size of This approach requires us to define an alignment function b For self-attention, we assume a 1-1 alignment so the alignment function is the identity function b i = i.For cross-attention, during training we use a linear alignment function where J is the number of tokens in the source document and I is the number of tokens in the target document.This is not possible during decoding, as we do not know the target document length beforehand.Therefore, we propose three different ways to approximate the alignment during decoding: where we define train_ratio as the average source-target ratio over all documents in the training data.
3. sent-align: assume we have already produced N ′ full target sentences (i.e.we have produced N ′ <sep> tokens) up to this point, then , otherwise with J n being the length of the n-th source sentence in the input document.In simple terms, when starting to decode a new sentence, we always force-align to the beginning of the corresponding source sentence.
We also test the window-attention approach with relative positional encoding in the self-attention instead of absolute positional encoding, which in this framework only requires a small modification to Equation 1: where r i−j ∈ R d are additional learnable parameters of the network.

Decoding
During decoding, given a document F N 1 , we want to find the best translation ÊN 1 according to our model.We can not perform exact search due to computational limitations, therefore we have to use approximations.There exist multiple approaches for decoding with a document-level NMT system and since we could not determine a single best approach from the literature, we describe and compare two competing approaches.

Êi
i−k = argmax which is approximated using standard beam search on the token level (we use beam size 12 for all experiments).For the full documentlevel systems, we simply use Sequential Decoding (SD) (Miculicich et al., 2018;Voita et al., 2019;Garcia et al., 2019;Fernandes et al., 2021): we generate the translation sentence by sentence, using the previously generated target sentences as context:

Evaluation
For all tasks we report BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) using the SacreBLEU (Post, 2018) toolkit.In addition, for the two En-De tasks (NEWS and OS) we analyze the translations in regards to ambiguous pronouns and style.For pronouns, the goal is to measure how well a system can translate the English 3rd person pronoun 'it' (and its other forms) into the correctly gendered German form (which can be male, female or neuter depending on the context).For style, the goal is to measure, how well a system can translate the 2nd person pronoun 'you' (and its other forms) into the correct style in German.For example, 'you' (singular) can be translated into 'sie' or 'du' in German, depending if the setting is formal or informal.We employ several strategies to determine the systems ability to disambiguate these phenomena.
We perform experiments on three document-level translation benchmarks.We call them NEWS (En→De) with newstest2018 as test set, TED (En→It) with tst2017 as test set and OS (En→De) where the test set is simply called test.NEWS is a collection of news articles, TED is a collection of transcribed TED talks and their respective translations and OS consists of subtitles for movies and TV shows.Especially the latter holds many examples for discourse between different entities.For the details regarding data conditions, preparation and training, we refer to Appendix A.2.

GPU Memory efficiency
First, we compare the GPU memory consumption of the baseline concat-adj.approach against the window-attention approach for various input sequence lengths.The results are shown in Table 2.
As expected, the memory usage increases at a much # target tokens concat-adj.window-attn w = 10 w = 20 736 2.3 GB 2.4 GB 3.5 GB 1472 5.8 GB 3.9 GB 5.9 GB 2208 10.9 GB 5.2 GB 8.5 GB higher rate for the concat-adj.approach, while the window-attention approach scales roughly linearly, the slope being a function of the window-size w.

Comparison of Decoding Strategies
After training all models on the NEWS task according to Appendix A.2, we test the different search strategies for each of the systems, the result of which can be found in Table 3.For the baseline concat-adj.approach as well as the LST-attn approach, FSD works best.using long context information.For concat-adj. and LST-attn with 1000 tokens context size, SD performs very poorly.This is because when beginning translating a document, the input sequences are very short and the systems can not appropriately handle that.However, FSD sometimes leads to sentence-misalignment while translating a document, resulting in a lower BLEU score as well.For the window-attention approach (rel.pos.enc., sentalign, window-size 20) we find that the SD decoding strategy works best.Since this approach seems to be able to better handle short input sequences, SD performs better than FSD, since it seems more robust to sentence-misalignment.Moving forward, all numbers reported will be generated with the best respective decoding approach, i.e.SD for windowattention and FSD for all other approaches.

Hyperparameter Tuning
Our window-attention approach has three hyperparameters that need to be tuned: (i) positional encoding variant (ii) alignment variant during search (iii) window size.Again, we use the NEWS task for tuning and the results for the different variants can be found in Table 4.
In terms of positional encoding, relative works significantly better than absolute for the windowattention system.We also test relative positional encoding (window-size 20) for the baseline concatadj.method, but here the training did not converge.This is, because for long input sequences the system without explicit target-source alignment can no longer distinguish the token ordering on the source side (on the target side it is still possible due to the causal attention mask).The only way to resolve this would be to drastically increase the windowsize for the relative positions, however, this would add a significant amount of additional parameters to the network so we decide against this.In terms of alignment, using the sent-align variant significantly outperforms the other approaches.For the window-size, 20 works best.An important finding is, that if we make the window too large, we start losing performance, probably due to the less focused attention problem discussed in Section 3.1.

Final Performance Comparison
In Table 5  level baseline on all tasks.On the OS test set, there is a disagreement between BLEU and TER which we think comes from the fact that the average sentence-length on this test set is quite short.The hypothesis of the sentence-level system is the shortest of all hypotheses and also shorter than the reference which gets punished more heavily by BLEU than TER.Out of all full-document approaches, window-attention performs best and is on par with the sentence-level baseline and the document-level system using only 2 sentences as context.For full-document translation, LST-attn performs better than the baseline concatenation approach but still falls behind the sentence-level system especially on the NEWS and TED tasks.One possible reason for why these approaches work better on OS is, that for this task we have much more training data available than for NEWS and TED.
We argue that this could also be the reason for the conflicting results reported by Junczys-Dowmunt (2019) and Saleh et al. (2019) compared to the other works who report performance degradation for longer context sizes (see Section 2).However, we leave a detailed analysis of this for future work.
Next, we analyze the ability of the systems to translate ambiguous pronouns and to translate in a consistent style using the methods explained in Section 3.5.The results for the two En→De tasks can be found in Table 6.For both NEWS and OS, all document-level systems can significantly improve over the sentence-level baseline in terms of pronoun translation.We also find that a context longer than two sentences does not seems to help for the pronoun task.This is actually to be expected since typically the distance between noun and pronoun is not that large and according to Müller et al. (2018), the overwhelming majority of ContraPro test cases do not require more than two sentences as context.For the correct translation of the style however, the larger context size is clearly beneficial, Dr. Webber braucht dich nicht, um seine Schlachten zu kämpfen.What you did stands to hurt this entire hospital.
Was du getan hast, hat dem ganzen Krankenhaus geschadet.Your first priority needs to be this place and its patients.
Table 7: Example translation of a snippet from the OpenSubtitles test set.Formal 2nd person pronouns are marked in red and informal ones are marked in blue.
as the system with just 2 sentences as context can barely outperform the sentence-level baseline.To correctly infer the style of a conversation, ideally the whole dialog should be part of the context, especially the beginning of the conversation.In Table 7, we show a snippet of the test set of the OS task together with the translations of the sentence-level system and the window-attention system.This example highlights the need for long-context NMT systems especially for the task of dialogue translation, since there we need to stay consistent in terms of style, which the sentence-level system can not manage.Overall, the LST-attn approach performs best for the task of formality translation, but the other full-document systems are not far behind.

Conclusion
In this work, we focus on methods to increase the context-size for document-level NMT systems.
We point out the shortcomings of the baseline approaches to long-context document-level NMT and in turn propose to modify the attention component to be more focused and also to be more memory efficient.We compare our approach against approaches from literature on multiple translation tasks and using different targeted evaluation methods.We confirm the improved memory efficiency of the proposed method.We find that for some discourse phenomena like pronoun translation, the longer context information is not necessary.For other aspects, like consistent style translation, the longer context is very beneficial.It seems that the baseline concatenation approach needs large amounts of training data to perform well for larger context sizes.We conclude that our approach performs among the best across all tasks and evaluation methods, with the additional benefit of reduced memory consumption for long input sequences.

Limitations
This work is about document-level NMT, we focus specifically on methods that improve the model performance for long input sequences.Due to constrained resources, this work has several limitations.To be able to train all methods including the inefficient baseline approach, we have to limit the context size to 1000 tokens.While we do a comparison to existing approaches, other approaches have been proposed to improve the performance of systems with long context information, which we do not compare against.We run experiments on three different tasks, but two of them are low resource and two of them translate into German, which was necessary because we only had access to German language experts for preparing the evaluation.

A.1 Pronoun and Formality Translation Evaluation
Here, we explain how we calculate the pronoun translation and formality translation F1 scores.

Pronouns
For each triplet (F n , E n , Ên ) (source, hypothesis, reference) of our test data we first check if it contains a valid ambiguous pronoun.That means, in the source sentence there must be an English 3rd person pronoun in the neutral form and it also must be labeled as a pronoun by the English POStagger.We also check if a 2nd or 3rd person plural pronoun is present in the source and if that is the case, we do not consider female pronouns on the target side, since we could not distinguish if e.g.'sie' is the translation of 'it' or 'they'.This would require a word alignment between source and hypothesis/reference which we do not have.If we found the example to be valid, we then check for occurrences of 3rd person pronouns in the male, female and neuter forms, in both reference and hypothesis using a German POS-tagger as well as language-specific regular expressions.After going through the complete test data (F n , E n , Ên ) sentence-by-sentence we calculate an F1 score for pronoun translation: where CP(•, •, •) counts the number of valid pronoun occurrences and x ∈ {male, f emale, neuter}.Formality We follow almost exactly the same steps as for detecting the pronoun translations described above.The only differences are that we check for validity slightly differently and instead of pronouns we check for occurrences of formal/informal style.For sentence-pairs where 3rd person female/neuter or 3rd person plural pronouns are present, we do not count the formal occurrences, since we might not be able distinguish the German translations in these cases.We calculate an F1 score for formality translation using where CP(•, •, •) counts the number of valid pronoun occurrences and x ∈ {f ormal, inf ormal}.
The POS-taggers we use are en_core_web_sm2 for English and de_core_news_sm3 for German.For both languages, spaCy claims an accuracy of 97% for POS-tagging and in our testing we did not find even a single error in pronoun-tagging.For calculating the Pronoun Translation F1 score we use the same ContraPro test set as described in Section 3.5 with the correct references.For calculating the Formality Translation F1 score, we use the test set from the OS En-De task.The statistics for both test sets are reported in Table 8.In the ContraPro test set, for each gender class we have exactly 4,000 examples.The fact that we identify more than 4,000 valid examples for the pronoun case means, that in some cases we identify multiple pronouns per sentence.All in all, we find the classes to be relatively balanced for these test sets.

A.2 Dataset Statistics and Experimental Setups
For the NEWS En→De task, the parallel training data comes from the NewsCommentaryV14 corpus4 .As validation/test set we use the WMT newstest2015/newstest2018 test sets from the WMT news translation tasks (Farhad et al., 2021).
For the TED En→It task, the parallel training data comes from the IWSLT17 Multilingual Task (Cettolo et al., 2017) use the concatenation of IWSLT17.TED.dev2010 and IWSLT17.TED.tst2010 and as test set we use IWSLT17.TED.tst2017.mltlng.For the OS En→De task, the parallel training data comes from the OpenSubtitlesV2018 corpus (Lison et al., 2018).We use the same train/validation/test splits as Huo et al. (2020) and additionally remove all segments that are used in the ContraPro test suite (Müller et al., 2018) from the training data.The data statistics for all tasks can be found in  Since in the original release of ContraPro only left side context is provided, we extract the right side context ourselves from the OpenSubtitlesV2018 corpus based on the metainformation of the segments.For translation of the ContraPro test set, as well as for scoring the contrastive references, we take both the left-and the right-side context into account.For the fulldocument systems, we cap the context size for the ContraPro test set to 4 sentences for computational reasons.
We tokenize the data using byte-pair-encoding (Sennrich et al., 2016;Kudo, 2018) with 15k joint merge operations (32k for OS En→De).The models are implemented using the fairseq toolkit (Ott et al., 2019) following the transformer base architecture (Vaswani et al., 2017) with dropout 0.3 and label-smoothing 0.2 for NEWS En→De and TED En→It and dropout 0.1 and label-smoothing 0.1 for OS En→De.This resulted in models with ca.51M parameters for NEWS and TED and ca.60M parameters for OS for both the sentence-level and the document-level systems.
Let us assume that the training data C consists of M documents D m and each document consists of source-target sentence pairs (F n,m , E n,m ).The goal of training is to find the optimal model parameters θ which minimize the loss function: All systems are trained until the validation perplexity does no longer improve and the best checkpoint is selected using validation perplexity as well.Training took around 24h for NEWS and TED and around 96h for OS on a single NVIDIA GeForce RTX 2080 Ti graphics card.Due to computational limitations, we report results only for a single run.For the generation of segments (see Section 3.4), we use beam-search on the token level with beamsize 12 and length normalization.
(E n,m n−k,m |F n,m n−k,m ).When we take full documents as input to the model, the loss function simply becomesL(θ) = − 1 M M m=1 log p θ (E Nm,m 1,m |F Nm,m 1,m ).

Table 1 :
Percentage of attention on the n-th source sentence during decoding the n-th target sentence, as well as overall translation quality measured in BLEU, for the newstest2018 test set of the NEWS task.
et al., 2017).Assume we are given a document

Table 2 :
GPU-memory consumption for the different approaches when training on a single document of specified number of target tokens.

Table 4 :
we report the translation performance of the different document-level approaches on all three translation benchmarks measured in terms of BLEU and TER.None of the document-level systems can consistently outperform the sentence-Results for the different hyperparameter settings of the window-attention system reported on the newstest2018 test set of the NEWS task.All systems have context size 1000 tokens.

Table 5 :
Results for the different document-level approaches in terms of BLEU and TER on the three translation benchmarks.Best results for each column are highlighted.External baselines are from † Kim et al. (2019), *Huo et al. (2020)22)and*Huo et al. (2020).

Table 6 :
Results for the different document-level approaches in terms of pronoun and formality translation.Best results for each column are highlighted.source reference What's between you and Dr. Webber -is none of my business... Was zwischen dir und Dr. Webber ist, geht mich nichts an... -You don't owe me an apology.Du schuldest mir keine Entschuldigung.You owe Dr. Bailey one.Du schuldest Dr. Bailey eine.We were taking a stand for Dr. Webber.Wir haben uns für Dr. Webber eingesetzt.I don't understand why... Ich verstehe nicht wieso... Dr. Webber doesn't need you to fight his battles.

Table 8 :
Number of valid examples for specific ambiguous pronoun/style translation in the reference of our test sets.

Table 9 :
Data statistics for the different document-level translation tasks.