Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting

Idioms are common in everyday language, but often pose a challenge to translators because their meanings do not follow from the meanings of their parts. Despite significant advances, machine translation systems still struggle to translate idiomatic expressions. We provide a simple characterization of idiomatic translation and related issues. This allows us to conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations. To expand multilingual resources, we compile a dataset of ~4k natural sentences containing idiomatic expressions in French, Finnish, and Japanese. To improve translation of natural idioms, we introduce two straightforward yet effective techniques: the strategic upweighting of training loss on potentially idiomatic sentences, and using retrieval-augmented models. This not only improves the accuracy of a strong pretrained MT model on idiomatic sentences by up to 13% in absolute accuracy, but also holds potential benefits for non-idiomatic sentences.


Introduction
An idiom is a conventionalized expression in which the intended meaning differs from its literal translation.The translation of idioms has remained a problem for state-of-the-art research and commercial translation systems, as idioms tend to be translated literally (Dankers et al., 2022b;Shao et al., 2017;Anastasiou, 2010).Failure to translate these expressions correctly may lead to incomprehensible translations, particularly in literary text (Toral and Way, 2018).To illustrate the difficulty of understanding mistranslated idioms, we present mistranslations from commercial systems in Table 1. 3  Although idiom translation has been recognized as a problem even before the advent of neural machine translation (Bar-Hillel, 1952;Wehrli, 1998), most work has focused on identifying and evaluating the problem cross-linguistically (Baziotis et al., 2022;Dankers et al., 2022b), or on interpreting the behaviour of transformer-based models in translating or memorizing idioms (Haviv et al., 2022;Dankers et al., 2022b).Others pose idiom identification and paraphrasing as a separate task from machine translation (Pershina et al., 2015).Comparatively fewer recent works have attempted to remedy this problem.Early work made use of idiom dictionaries and direct substitution, or example-based machine translation (Salton et al., 2014;Nagao, 1984).However, we would ideally want to make use of the contextual translation abilities of neural models.Data augmentation and the creation of new datasets have helped address this problem (Agrawal et al., 2018), but it may also be possible to use existing data resources more effectively, especially for higher-resource languages.
We first frame the general problem of noncompositional translation, which encompasses the translation of idioms and other multi-word expressions that cannot be translated word-for-word ( §2).We then perform synthetic experiments in a very simple case, finding that transformer-based machine translation models generally translate wordfor-word until a proportional threshold of sentences contain non-compositional expressions, at which point the translations flip to being correct ( §4.1).We evaluate translations by commercial models in three natural languages, and find a drop in performance on idiomatic sentences and stronger performance on more common idioms ( §4.2).We hypothesize that this may reflect similar trends as exist in processing other long-tail phenomena, and similar tactics to those used to deal with rare phenomena may work (Kandpal et al., 2022).
With this intuition, we improve the idiomatic translations generated by a strong pretrained machine translation model, ∆LM (Ma et al., 2021), without harming the translation quality of literal expressions.To contribute resources toward documenting idioms and improving their translation cross-linguistically, we create a dataset of sentences containing idiomatic expressions in three languages (French (fr), Finnish (fi) and Japanese (ja) ( §3).
We propose two simple but effective ways to improve translation of idioms, namely upweighting training loss on potentially idiomatic sentences and retrieval augmentation ( §5).We find that this can improve the idiomatic translation abilities of the model significantly, by an average of 10.4% in absolute accuracy ( §7.1).Moreover, this does not harm translation of sentences where the literal sense of the idiom is used, and it improves translation of out-of-distribution sentences in French and Finnish as well.We perform human evaluation and error analysis, and find that the rate of severe semantic errors is reduced by an average of 7.52% absolute accuracy ( §7.2).The ultimate aim for machine translation is to ensure accessibility for all texts.This requires addressing idiomatic phrases, culturally-informed language, and complex semantics.We demonstrate the potential for enhancing idiom translation using existing resources.

Background on Idioms
Idioms are commonly understood to be fixed expressions that contradict the principle of compositionality in language, which is to say that their meaning cannot be predicted from the meanings of their parts (Radford, 2004;Portner, 2005).Idioms occur relatively frequently in all languages, and are often challenging for non-native speakers (Cooper, 1999).For instance, a literal translation of one Portuguese idiom is "it is from little that you twist the cucumber".This is difficult to understand.However, an equivalent English expression is "As the twig is bent, so is the tree inclined", which refers to actions during childhood influencing behaviours that people have as adults (Unbabel, 2019).This example illustrates the importance of translating idioms using equivalent idioms from the target culture, or a paraphrase if there is no equivalent.Idiomatic expressions are heavily shaped by the culture of language speakers, including religious beliefs, history, geography, and cuisine.For instance, food-related idioms in English tend to refer to foods such as beef and potatoes, while in Chinese, these idioms tend to refer more to rice and tofu (Yang, 2010).Cross-cultural knowledge is important in choosing a translation that conveys the proper intent to readers in the target language (Liu, 2012).Overly-literal translations and lack of broader context are two reasons why machine translation is still not at parity with human translators, particularly when translating literary text (Matusov, 2019;Omar and Gomaa, 2020;Poibeau, 2022).

Formal definition
We use the idea of non-compositionality to frame idiomatic translation more precisely.Let X = {x 1 , ..., x N } be the set of tokens in the source language, and Y = {y 1 , ..., y M } be the set of tokens in the target language.Suppose that we have an oracle function TRANSLATE : X * → Y * that always produces a correct translation.We can imagine this to be a helpful speaker who is perfectly familiar with both languages and never misreads text.Then we can say that a multi-token string requires non-compositional translation if it can be translated correctly by the oracle as a whole, but it cannot be translated correctly by individually translating parts of the sentence and joining them (according to the target language's word order).In other words, for a string of tokens x 1 , ..., x n , 4 We note that this definition is very general and also includes other phenomena such as multi-word expressions and named entities.However, we can now use this definition to create a relevant synthetic task, allowing us to observe translation compositionality under different settings ( §4.1).

Idioms and Data Collection
We can use the formal definition from the previous section to generate synthetic data for experiments.However, we ultimately want to improve translation of real idioms.To do so, we collect a dataset of natural sentences to evaluate commercial systems and the model we seek to improve.
Although a large corpus of potentially idiomatic expressions exists in English (Haagsma et al., 2020), there are no readily accessible equivalents in other languages.Therefore, we collected idioms in French, Finnish, and Japanese from languagelearning sites, listed in Appendix B. These languages were chosen for phylogenetic diversity, and due to availability of commercial translation systems.In total, there were 148 French idioms collected, 92 in Finnish, and 1336 in Japanese.
To collect sentences containing these idioms, we matched on lemmatized forms from the 2018 version of OpenSubtitles (Lison et al., 2018), where lemmatization was performed with Stanza (Qi et al., 2020).In total, there were 85632 French sentences containing potentially idiomatic expressions, 51811 Finnish sentences, and 23018 Japanese sentences.To filter out unaligned sentences, we scored each source and reference sentence using COMET-QE (Rei et al., 2020) and removed the bottom 10% of each language's sentences by COMET-QE scores.
Some idioms have a plausible literal meaning (such as "kick the bucket" to mean kicking a physical bucket).To make sure that all examples in 4 X denotes string concatenation given the word order of language X, i.e. if the word order is SVO, the tokens belonging to the subject should be placed in front of the tokens belonging to the verb, and so on.
the idiomatic test set were actually idiomatic, we sorted sentences into an idiomatic test set where the idiomatic meaning of a phrase was used (e.g."to die") and a literal test set, where the literal meaning of the phrase was used (e.g.kicking a physical bucket).The first 100 examples containing each idiom's lemmatized form were collected, and up to the first 3 (for Japanese) or 5 (for Finnish and French) literal and figurative examples in this set were collected to create the test set.This was to avoid dominance of very common idioms in the test set.This created two test sets related to the idiom list for each language, the idiomatic and literal test sets.
Finally, we collect a random test set from an alternate source, the Ted Talks corpus (Reimers and Gurevych, 2020).This is to ensure that translation quality of other, unrelated sentences is not impacted by any modifications meant to improve translation of idioms.We collect sentences from Ted Talks, rather than OpenSubtitles, because it allows us to also examine a domain shift.This is because topics discussed and vocabulary used in Ted Talks may be slightly different from what is discussed in movies or TV shows.To control for translation length as a source of difficulty, sentences were length-matched on the target side with corresponding sentences in the idiomatic set.This created the random set, which is the same size as the idiomatic test set.All three test sets are summarized in Table 2. Table 2: Size of test sets for each language.The idiomatic and literal sentences contain strings matching known idioms (after lemmatization), but the random set contains unrelated sentences from the Ted Talks corpus.

Evaluating Non-Compositional Translation 4.1 Artificial Language Translation
We first use the definition of non-compositional translation in ( §2) to create a synthetic task.This allows us to gain an understanding of how much data is required to memorize non-compositional patterns.Although this experiment is not realistic to natural language (notably, there is no token-level ambiguity in this experiment), we note that using synthetic experiments allows us to easily extend the data generation setup and examine model behaviour along many different conditions, such as informativity.
We generated synthetic training corpora of several sizes containing different numbers of occurrences of the non-compositional rule 0 1 → 12 .The number of training sentences ranged from 100k to 10M, while the number of noncompositional occurrences ranged from 10 to 1M.We examined two informativity conditions, corresponding to the case where the context provides no information (tokens are randomized around the non-compositional expression), and the context being perfectly informative.The perfect informativity condition was achieved by adding the canary token "11" to the source vocabulary, and only inserting this token prior to the non-compositional pattern "0 1".
We experimented with three different transformer sizes (Vaswani et al., 2017), each of which had a hidden dimension and embedding size of 512, as well as 16 attention heads.Only the number of encoder and decoder layers varied, such that the small transformer had 3 encoder and decoder layers, the medium transformer 8, and the large transformer 16.We fix the number of epochs for the small, medium and large models to respectively be 10, 20, and 30 in the non-informative case and 15, 15 and 25 in the informative case. 5Further training details can be found in Appendix A.
Although this may seem like a simple task, we found it surprisingly difficult for models to learn this non-compositional pattern.Results in each setting, averaged across 5 random seeds, are presented in Figure 1.Especially for the small model, there is a sharp gradation from translating none of the non-compositional expressions correctly to translating them all correctly, which occurs when roughly 10% of training data contains a non-compositional pattern.A similar trend exists for larger models, but the threshold is less distinct.This corroborates the tendency for transformers to translate non-compositional phrases literally (Dankers et al., 2022b).Comparatively less data is required when the context is informative, but the trends remain similar to the non-informative case.As model size and corpus size increase, the rate of correct translations for non-compositional examples actually drops, contrary to expectation.
It is unlikely that any individual idioms occur in 10% of sentences in natural language.Due to the highly regular translation rules in this synthetic language, there may be a stronger bias toward translating compositionally in this experiment.However, we gain the intuition that idioms can be translated effectively if they appear frequently, and that clear context clues reduce data required.

Evaluation of Commercial Systems
Although synthetic experiments provide intuition on the difficulty of translating idioms, one might ask whether similar results hold in natural language.To answer this, we examine the performance of commercial systems on the test sets in ( §3).Namely, we examine Google Translate and DeepL on Finnish, French, and Japanese idiomatic, literal, and random sentences.Results are in Table 3.We observe drops in translation quality on idiomatic sentences in all languages, with lower automatic metrics overall.Although it's impossible for us to determine what data these commercial systems were trained on, we examine the frequency of each idiom within OpenSubtitles as a proxy for its overall frequency in the training data, and bucket idioms into quintiles based on their occurrence frequency in source text.As idioms become more frequent, the quality of translations increases.An example of DeepL on the French idiom set is shown in Figure 2. Trends for other languages and systems are in Appendix G.This indicates that like in the synthetic experiments, there may be strong frequency effects on translation quality of idioms.

Methods to Improve Non-Compositional Translation
We explore two methods to improve translation, loss weighting and kNN-MT.These two methods are relatively simple to use, where loss weighting only requires a list of potentially idiomatic phrases in the source language, and kNN-MT only requires enough space on disk to save the datastores.
More formally, we consider the basic case of autoregressive machine translation, with a set of parallel sentences in the source (X = {x (i) } N i=1 ) and target (Y = {y (i) } N i=1 ) language: D = {(x (i) , y (i) ), ..., (x (N ) , y (N ) )}.The model p θ with parameters θ is trained by minimizing the loss: Upweighting here refers to sentence-level upweighting, where there is a set of sentences A that we'd like to upweight with a weight coefficient α.In this case, A would be potentially idiomatic sentences.We keep all other parameters for training the same as in the base model.
kNN-MT augments a translation model with a retrieval component (Khandelwal et al., 2021).Given each sentence (x, y), we construct a datastore with keys based on hidden representations constructed from the translation model, and values being the next word in the target sentence.
During generation, a probability distribution over next words can be computed based on the retrieved next words and the distance of their keys to the current context.A parameter λ controls interpolation between the distribution over next words predicted by the base model, and the distribution predicted by the retrieved k neighbours.6 p(y We also combine loss weighting with kNN-MT, where a model is trained with sentence upweighting and interpolated with a datastore based on representations from the upweight-trained model.
Intuitively, these methods make sense to use for idiom translation -we have previously seen that one problem with non-compositional phrases may simply be their rarity.Upweighting training examples that contain idioms may help with underrepresentation.Furthermore, retrieving similar examples may find occurrences of the same idiom which were translated correctly.
6 Experimental Settings

Experimental Settings
We run experiments on ∆LM-base, a transformer encoder-decoder model with 360M parameters, a larger version of which ranked first in the WMT21 multilingual translation task (Ma et al., 2021;on Machine Translation , WMT21).We train one ∆LM model for each language pair.Each model was trained for 2 million steps, and the checkpoint with the best loss on the validation set was kept.Further details are in Appendix C. To decode, we used beam search with a beam size of 5.

Data
Models were trained on OpenSubtitles for each language pair.Data from test sets were removed, and 10% of the remaining data was used as a validation set.There were 33.8M sentences in the fr-en train set, 22.0M in fi-en, and 1.6M in ja-en.

Evaluation
We use multiple automatic metrics to evaluate translation quality.However, due to the importance of accurate semantic evaluation, the authors (native English speakers and fluent in French and Japanese) conduct a human evaluation inspired by MQM (Lommel et al., 2014).Only errors that would fall under the"terminology" and "accuracy" error types are considered, as we are focused on severe semantic errors.We give a score of 0 for severe errors and a score of 0.5 for major errors.A score of 1 is given otherwise.Exact evaluation standards are in Appendix D.

Automatic and Human Evaluation
In most cases, as reported in Figure 3, using a combination of sentence upweighting and kNN-MT led to the greatest increase in automatic metrics on all three test sets, of up to 3.08 BLEU points on the idiomatic test set (fr), 2.69 BLEU points on the literal test set (fi), and 5.75 points on the random test set (fr).In all cases except ja-rand, using one or more of these methods improved over the baseline.Exact numerical results are in Appendix I.  We evaluate the statistical significance of the results through a one-tailed permutation test (Graham et al., 2014).Further details are in Appendix E. Exact results are in Appendix F. For Finnish, significance is achieved for all three test sets, and for French, significance is achieved for the idiomatic and random test sets.For Japanese, values achieved are not significant, but are borderline.
As our focus is on mitigating semantic errors, we mostly focus on the results of human evaluation, which are summarized in Table 4. Here, we also find that using both sentence upweighting and kNN is the best condition in most cases, increasing accuracy by roughly 13% in French and Finnish, and 4.5% in Japanese for idiomatic sentences.Encouragingly, this does not overly harm translation of literal sentences, as accuracy on the literal set either increases slightly (by roughly 4% in French and Finnish), or decreases very slightly (by roughly 0.4% in Japanese).For the random set, the combination of sentence upweighting and kNN-MT by around 7% accuracy.However, in Japanese, performance on the random test set decreases by 4%.In all cases except ja-rand, one or more of these methods improves over the baseline.
We note that the Japanese model was trained on roughly 1/10th of the data of the French and Finnish models, so its translations are not as highquality.This also leads to the construction of a much smaller datastore, which may lead to weaker performance on the random set.

Error analysis
We repeat the frequency analysis performed on commercial systems ( §4.2) for ∆LM, and find that adding upweighting and kNN-MT generally improves translations at all frequency levels.These increases are not concentrated in low-frequency idioms, so more common idioms continue to be translated better. 7A representative example (for French) is in Figure 4.A complete set of plots are in Appendix H.We examine the rate of severe and major errors made in the base model and the upweight+knn model in Table 5.In French and Finnish, the rate of critical errors decreased greatly, particularly in the idiomatic and random test sets.This is true to a lesser extent in Japanese.Major errors also decreased to a lesser extent.The only test set where errors increase is again the ja-rand test set.We note that it's possible for the rate of major errors to be higher in the upweight+knn model because some severe errors transitioned to major errors.One question is why the error rate on out-ofdistribution sentences drops for French and Finnish.In fi-rand, the severe error rate more than halves (0.1317 → 0.603), and in fr-rand, it nearly halves (0.1624 → 0.09407).However, it is unclear why this should be the case.We examined sentences where the original translation was incorrect but the upweight+knn translation was correct, and found that they tended to contain named entities.For instance, for the sentence "La chirurgie à coeur ouvert au Nigeria, c'est un gros problème.(Open heart surgery in Nigeria -big trouble.)", the base model incorrectly produced the translation "Open-heart surgery in Forbes, that's a big problem.",while the upweight+knn model translated correctly.In some cases, words with multiple pos-sible translations (e.g.spectre: ghost, spectrum) became correctly translated."Mais regardez le nombre de lignes noires dans ce spectre.(But look at the number of black lines in that spectrum.)"was originally translated incorrectly as "But look at the number of black lines in that ghost".

Related Work
Recent work has raised the issue of idiom handling in MT (Baziotis et al., 2022;Dankers et al., 2022b,a).There is historical recognition of the problem, including of multi-word expressions (Sag et al., 2002;Calzolari et al., 2002;Zaninello and Birch, 2020).This has historically motivated example-based machine translation (Nagao, 1984).Similar motivations underlie the use of kNN-MT.However, neural models may already be capable of translating idiomatic phrases if they appear often enough in training data.
Other works focus on data augmentation and creating new data resources (Ho et al., 2014;Fadaee et al., 2018;Agrawal et al., 2018;Haagsma et al., 2020).A related task is detection of conventionalized metaphors (Levin et al., 2014).Automatic identification of idiomatic phrases, as well as data augmentation are promising avenues to improve performance in lower-resource languages.
Instance weighting has been explored previously in the MT literature, but has been mostly explored in the context of domain adaptation, rather than being used to improve translations of rare or noncompositional phrases in the same domain (Foster et al., 2010;Wang et al., 2017).
Idiomatic phrases are a prototypical case of phrases that need to be memorized (Haviv et al., 2022).Many also occur infrequently in training data, which may make it difficult for transformerbased models to translate them (Kandpal et al., 2022).This can be mitigated, as we have shown in this paper.However, more work is needed to effectively learn idioms and other infrequent linguistic elements with few repetitions.

Conclusion
We highlight the challenge idiomatic expressions pose to machine translation systems and provide simple solutions to improve performance.Through synthetic experiments, we identify a threshold at which transformer-based models correctly default to idiomatic translations.We develop a dataset of sentences containing idiomatic expressions in French, Finnish, and Japanese, and introduce two techniques -upweighting training loss on potentially idiomatic sentences and augmenting models with kNN-MT -which enhance the idiomatic translation accuracy of a strong model, while offering potential benefits for non-idiomatic sentences.
Future research could extend these techniques to additional languages, and explore their effectiveness in dealing with other long-tail phenomena.We hope that this work contributes toward increasing the intelligibility of translations containing idioms or set phrases.Ultimately, for machine translation to be useful for everyone without causing misunderstandings, "last mile" problems involving cultural knowledge, long-tail phenomena, and complex semantic evaluation should be taken into account.

Limitations
Our research provides a first step toward capturing non-compositional expressions in machine translation.However, we do not conclusively solve the problem, as ideally a machine translation system should be able to learn any idiom or noncompositional phrase from a few examples.
First, our experiments were conducted on a select group of languages (Finnish, French, and Japanese), which do not fully capture the variety and complexity of languages worldwide.Given the diversity of language structures and idiomatic expressions, the generality of our findings to languages with drastically different grammatical structures or idiom usage patterns remains uncertain.
Next is our use of synthetic data.While synthetic data allowed us to control for certain variables, our setting is purposefully simplified, potentially limiting the ecological validity of our findings.Although our synthetic language was designed to mimic non-compositional translation issues, it may not encapsulate the full extent of such complexities in real-world languages.Namely, there is only one non-compositional pattern and the remaining translations are one-to-one mappings.
Our research also depends on the quality and representativeness of the training and evaluation corpora.For instance, certain idioms may be overrepresented or underrepresented, which could affect the translation performance.
Lastly, our improvement methods, namely upweighting and kNN-MT, have inherent limitations.Upweighting could lead to overfitting on idiomatic expressions and may not be as effective when idioms occur infrequently in the data.On the other hand, kNN-MT might not yield significant improvements if the idiom or its correct translation rarely appears in the training data, limiting its utility in such scenarios.
Future work could address these limitations by expanding the linguistic scope of the study, exploring more complex methods or architectures, or investigating to what extent similar techniques can be applied to related issues in semantic preservation during machine translation.

A Synthetic Dataset Training Details
Three encoder/decoder transformer sizes were trained, differing only in number of encoder/decoder layers.The small size had 3 encoder/decoder layers, the medium size had 8 encoder/decoder layers, and the large size had 16 encoder/decoder layers.For all models, the hidden dimension was 512, the embedding dimension was 512, and there were 16 attention heads.
In the experiments without informative context, the small transformer was trained for 10 epochs, the medium transformer for 20, and the large for 30.In experiments with context, this was changed to 15, 15, and 25 respectively.These values were based on early experimentation with loss plateaus on the validation set.
Sentences in the synthetic dataset were composed of tokens as described in subsection 4.1.Sentences were constrained to be 1-6 tokens in length.
Experiments with the synthetic dataset were implemented in PyTorch.

C OpenSubtitles Training Details
A separate deltalm-base was trained from the pretrained model for 2M steps on each language.The Adam optimizer was used (Kingma and Ba, 2017), with a learning rate of 1e-4, betas of (0.9, 0.98), and an inverse square root learning rate scheduler with 4000 warmup updates (with a warmup learning rate of 1e-7, and a minimum learning rate of 1e-9).The maximum number of tokens in a batch was set to 1024, with maximum source and target lengths of 512.Label smoothing of 0.1 was used, and the loss function used was cross-entropy.
For kNN-MT, three datastores were built for approximate kNN search using the training set from OpenSubtitles.These datastores were built with the faiss library (Johnson et al., 2019).The Finnish datastore contained 248M vectors with 507k centroids, while the French datastore contained 348M with 713k centroids, and the Japanese datastore 17.8M with 73k centroids.All vectors were stored in fp16 with a code size of 32.The vectors used as keys in the datastore correspond to the input to the last feedforward layer.Additionally, a hyperparameter search for each language was carried out on the validation set, over values of λ ∈ {0.2, 0.4, 0.6, 0.8}, temperature ∈ {0.1, 1, 10}, and number of retrieved neighbours ∈ {5, 10, 15, 20}.Hyperparameters were selected based on BLEU score on the validation set.

D Standards for Human Evaluation
Evaluation standards were based on MQM standards, in particular the major and critical severity levels, which are defined below (Lommel et al., 2014): • Major severity error: severity level of an error that seriously affects the understandability, reliability, or usability of the content for its intended purpose or hinders the proper use of the product or service due to a significant loss or change in meaning or because the error appears in a highly visible or important part of the content.
• Critical severity level: severity level of an error that renders the entire content unfit for purpose or poses the risk for serious physical, financial, or reputational harm.
We slightly adapted this for idioms, where if it was possible to infer the meaning of a sentence through the existence of a similar English idiom, but it would not be something generally said by English speakers, we assigned a major severity.Due to ∆LM often making errors with numbers and named entities, we assigned these errors a major rather than critical severity in most cases, although these would generally be critical severity errors in business documents.If part of a sentence was missing, we based the error severity on whether or not the missing portion was crucial to the intent of the sentence.Examples of sentences labelled with error severity are provided in systems actually have the same distribution, and the difference occurred by chance.
For each source sentence, we shuffled the translations produced by the two systems with probability 0.5, and then recalculated BLEU scores.This was repeated 1000 times, and the number of times the shuffled difference was greater or equal to the observed difference was recorded to find the p-value.

F Statistical Significance Results
Statistical significance results are shown in Table 7.

G Effect of Frequency on Commercial Translations
The effect of idiom frequency on commercial models' translations is in Figure 5 through Figure 9.

H Effect of Frequency on ∆LM
The effect of frequency on ∆LM for Finnish and Japanese is shown in Figure 10 and Figure 11 I Automatic Metrics in Detail Exact results for automatic metrics are shown in Table 8.

Figure 1 :
Figure 1: Accuracy of a transformer in translating a non-compositional phrase after training on datasets of different sizes, with different numbers of non-compositional patterns (only non-compositional translation accuracy is depicted).Results are averaged across 5 seeds, and standard deviation is shown.

Figure 3 :
Figure 3: Results of automatic metrics.In most cases, combining loss weighting with KNN-MT improves automatic metrics the most on all three test sets, including the out-of-distribution (Random) test set.

Table 1 :
He has to dot all the i's, cross all the t's.He always has to look for the little beast.Examples of mistranslated sentences produced by commercial translation systems.Idioms and their corresponding translations are highlighted in red.2 arXiv:2310.07081v1 [cs.CL] 10 Oct 2023

Table 3 :
Performance of commercial systems on idiomatic, literal, and random test sets.There is a clear degradation in performance on idiomatic sentences.

Table 5 :
Rate of major and severe errors in translations.

Table 6 :
E Statistical Significance TestingFor each language and test set, we examine the null hypothesis that the BLEU scores of the two

Table 6 :
Examples of categorized errors made by ∆LM.

Table 7 :
p-values obtained using approximate randomization on translations produced by the base model and the upweight+knn model.