The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation

Vocabulary selection, or lexical shortlisting, is a well-known technique to improve latency of Neural Machine Translation models by constraining the set of allowed output words during inference. The chosen set is typically determined by separately trained alignment model parameters, independent of the source-sentence context at inference time. While vocabulary selection appears competitive with respect to automatic quality metrics in prior work, we show that it can fail to select the right set of output words, particularly for semantically non-compositional linguistic phenomena such as idiomatic expressions, leading to reduced translation quality as perceived by humans. Trading off latency for quality by increasing the size of the allowed set is often not an option in real-world scenarios. We propose a model of vocabulary selection, integrated into the neural translation model, that predicts the set of allowed output words from contextualized encoder representations. This restores translation quality of an unconstrained system, as measured by human evaluations on WMT newstest2020 and idiomatic expressions, at an inference latency competitive with alignment-based selection using aggressive thresholds, thereby removing the dependency on separately trained alignment models.


Introduction
Neural Machine Translation (NMT) has achieved great improvements in translation quality, largely thanks to the introduction of Transformer models (Vaswani et al., 2017). However, increasingly larger models (Aharoni et al., 2019;Arivazhagan et al., 2019) lead to prohibitively slow inference when deployed in industrial settings. Especially for real-time applications, low latency is key. A number of inference optimization speedups have been proposed and are used in practice: Figure 1: Examples of subword-segmented idiomatic expressions (EN) and their German correspondences (DE) as well as an English gloss (GL) of the German expression. Alignment-based vocabulary selection: output tokens missing from the allowed set of top-k output tokens are marked in orange/bold (red/italic) for k=200 (k=1000). reduced precision (Aji and Heafield, 2020), replacing self-attention with Average Attention Networks (AANs) , Simpler Simple Recurrent Units (SSRUs) (Kim et al., 2019), or model pruning (Behnke and Heafield, 2020;Behnke et al., 2021).
Another technique that is very common in practice is vocabulary selection (Jean et al., 2015) which usually provides a good tradeoff between latency and automatic metric scores (BLEU) and reduced inference cost is often preferred over the loss of ∼0.1 BLEU. Vocabulary selection is effective because latency is dominated by expensive, repeated decoder steps, where the final projection to the output vocabulary size contributes to a large portion of time spent . Despite high parallelization in GPUs, vocabulary selection is still relevant for GPU inference for state-of-theart models.
However, we show that standard methods of vocabulary selection based on alignment model dictionaries lead to quality degradations not sufficiently captured by automatic metrics such as BLEU. We demonstrate that this is particularly true for semantically non-compositional linguistic phenomena such as idiomatic expressions, and aggressive thresholds for vocabulary selection. For example, see Figure 1 for alignment-model based vocabulary selection failing to include tokens crucial for translating idiomatic expressions in the set of allowed output words. While less aggressive thresholds can reduce the observed quality issues, it also reduces the desired latency benefit. In this paper we propose a neural vocabulary selection model that is jointly trained with the translation model and achieves translation quality at the level of an unconstrained baseline with latency at the level of an aggressively thresholded alignment-based vocabulary selection model.
Our contributions are as follows: • We demonstrate that alignment-based vocabulary selection is not limited by alignment model quality, but rather inherently by making target word predictions out of context ( §2).
• We propose a Neural Vocabulary Selection (NVS) model based on the contextualized deep encoder representation ( §3).
• We show that alignment-based vocabulary selection leads to human-perceived translation quality drops not sufficiently captured by automatic metrics and that our proposed model can match an unconstrained model's quality while keeping the latency benefits of vocabulary selection ( §4).

Pitfalls of vocabulary selection
We first describe vocabulary selection and then analyze its shortcomings. Throughout the paper, we use the recall of unique target sentence tokens as a proxy for measuring vocabulary selection quality, i.e. the reachability of the optimal translation. We use the average vocabulary size in inference decoder steps across sentences as a proxy for translation latency since it directly impacts decoding speed (Kasai et al., 2020).

Vocabulary selection
Vocabulary selection (Jean et al., 2015), also known as lexical shortlisting or candidate selection, is a common technique for speeding up inference in sequence-to-sequence models, where the repeated computation of the softmax over the output vocabulary V of size V incurs high computational cost in the next word prediction at inference time: ing the hidden size of the network. Vocabulary selection chooses a subsetV ⊂ V, withV V , to reduce the size of matrix multiplication in Equation (1) such that p(y t |y 1:t−1 , x; θ) = softmax(W h +b), (2) whereW ∈ RV ×d andb ∈ RV . The subsetV is typically chosen to be the union of the top-k target word translations for each source token, according to the word translation probabilities of a separately trained word alignment model (Jean et al., 2015;Shi and Knight, 2017). Decoding with vocabulary selection usually yields similar scores according to automatic metrics, such as BLEU (Papineni et al., 2002), compared to unrestricted decoding but at reduced latency (L'Hostis et al., 2016;Mi et al., 2016;Sankaran et al., 2017;Junczys-Dowmunt et al., 2018). In the following, we show that despite its generally solid performance, vocabulary selection based on word alignment models negatively affects translation quality, not captured by standard automatic metrics. We use models trained on WMT20 (Barrault et al., 2020a) data for all evaluations in this section, see Section 4.1 for details.

Alignment model quality
In practice, the chosen subset of allowed output words is often determined by an alignment model, such as fast_align (Dyer et al., 2013), which provides a trade-off between the speed of alignment model training and the quality of alignments (Jean et al., 2015;Junczys-Dowmunt et al., 2018). fast_align's reparametrization of IBM model 2 (Brown et al., 1993) places a strong prior for alignments along the diagonal. We investigate whether more sophisticated alignment models can lead to better vocabulary selection, especially for language pairs with high amount of reordering. To evaluate this we compute the recall of translation model and reference tokens using GIZA++ (Och and Ney, 2003) and MaskAlign 1  as seen in Table 1. We extract top-k word translation tables (from fast_align, GIZA++, and MaskAlign) by force-aligning the training data. Overall, GIZA++ achieves the best recall, and it is just slightly better than fast_align. MaskAlign, a state-of-the-art neural alignment model, underperforms fast_align with respect to recall. While performance of MaskAlign may be improved with careful tuning of its hyperparameters via gold alignments , we choose fast_align as a strong, simple baseline for vocabulary selection in the following.

Out-of-context word selection
Alignment-based vocabulary selection does not take source sentence context into account. A top-k list of translation candidates for a source word will likely cover multiple senses for common words, but may be too limited when a translation is highly dependent on the source context. Here we consider idiomatic expressions as a linguistic phenomenon that is highly context-dependent due to its semantically non-compositional nature. Table 2 compares the recall of tokens in the reference translation when querying the translation lexicon of the alignment model for two different top-k settings. Recall is computed as the percentage of unique tokens in the reference translation that appear in the top-k lexicon, or more generally, in the set of predicted tokens according to a vocabulary selection model. We evaluate two scopes for test sets of idiomatic expressions: the full source and target sentence vs. the source and target idiomatic multiword expressions according to metadata. The Idioms test set is an internal set of 100 English idioms in context and their human translations. ITDS is   the IdiomTranslationDS 2 data released by Fadaee et al. (2018) with 1500 test sentences containing English and German idiomatic expressions for evaluation into and out of German, respectively. The results show that recall increases when increasing k but is consistently lower for the idiomatic expressions than for full sentences. Clearly, the idiom translations contain tokens that are on average less common than the translations of "regular" inputs. As a consequence, increasing the output vocabulary is less effective for idiom translations, with recall lagging behind by up to 9.3%. This can directly affect translation quality because the NMT model will not be able to produce idiomatic translations given an overly restrictive output vocabulary. Table 3 shows a similar comparison but here we evaluate full literal translations vs. full idiomatic translations on a data set of English proverbs from Wikiquote 3 . For EN-DE, we extracted 94 triples of English sentence and two references, for EN-RU we extracted 262 triples. Although in both cases recall can be improved by increasing k, it helps considerably less for idiomatic than for literal translations. Figure 1 shows examples of idiomatic expressions from the ITDS set and the output tokens belonging to an idiomatic translation that are missing from the respective lexicon used for vocabulary  selection. While for some of the examples, increasing the lexicon size solves the problem, for others the idiomatic translation can still not be generated because of missing output tokens. These results demonstrate that there is room for improvement in vocabulary selection approaches when it comes to non-literal translations.

Domain mismatch in adaptation settings
Using a word alignment model to constrain the NMT output vocabulary means that this model should ideally also be adapted when adapting the NMT model to a new domain. Table 4 shows that adapting the word alignment model with relevant in-domain data (in this case, idiomatic expressions in context) yields strong recall improvements for vocabulary selection. Compared to increasing the per-source-word vocabulary as shown in Table 2, the improvement in recall for idiom tokens is larger which highlights the importance of having a vocabulary selection model which matches the domain of the NMT model. This also corroborates the finding of  that vocabulary selection can be harmful in domain-mismatched scenarios.
We argue that integrating vocabulary prediction into the NMT model avoids the need for mitigating domain mismatch because domain adaptation will update both parts of the model. This simplifies domain adaptation since it only needs to be done once for a single model and does not require adaptation or re-training of a separate alignment model.

Summary
We use target recall as a measure for selection model quality. We see that alignment model quality only has a limited impact on target token recall with more recent models actually having lower recall overall. In domain adaptation scenarios vocabulary selection limits translation quality if the selection model is not adapted. The main challenge for alignment-based vocabulary selection comes from its out-of-context selection of target tokens on a token-by-token basis, shown to reduce recall for translation of idiomatic, non-literal expressions. Increasing the size of the allowed set can compensate for this shortcoming at the cost of latency. However, this begs the question of whether context-sensitive selection of target tokens can achieve higher recall without increasing vocabulary size.

Neural Vocabulary Selection (NVS)
We incorporate vocabulary selection directly into the neural translation model, instead of relying on a separate statistical model based on token translation probabilities. This enables predictions based on contextualized representations of the full source sentence. It further simplifies the training procedure and domain adaptation, as we do not require a separate training procedure for an alignment model.
The goal of our approach is three-fold. We aim to (1) keep the general Transformer (Vaswani et al., 2017) translation model architecture, (2) incur only a minimal latency overhead that amortizes by cheaper decoder steps due to smaller output vocabularies, and (3) scale well to sentences of different lengths. Figure 2 shows the Neural Vocabulary Selection (NVS) model. We base the prediction of output tokens on the contextualized hidden representation produced by the Transformer encoder H ∈ R d×t for t source tokens and a hidden size of d. The t source tokens are comprised of t − 1 input tokens and a special <EOS> token. To obtain the set of target tokens, we first project each source position to the target vocabulary size V , apply max-pooling across tokens (Shen et al., 2018), and finally use the sigmoid function, σ(·), to obtain where W ∈ R V ×d , b ∈ R V and z ∈ R V . The max-pooling operation takes the per-dimension maximum across the source tokens, going from R V ×t to R V . Each dimension of z indicates the probability of a given target token being present in the output given the source. To obtain the target Bag-of-words (BOW), we select all tokens where z i > λ as indicated in the right-hand side of Figure 2, where λ is a free parameter that controls the sizeV of the reduced vocabularyV. At inference time, the output projection and softmax at every decoder step are computed over the predicted BOW of sizeV only. We achieve goal (1) by basing predictions on the encoder representation already used by the decoder. Goal (2) is accomplished by restricting NVS to a single layer and basing the prediction on the encoder output, where we can parallelize computation across source tokens. Inference latency is dominated by non-parallelizable decoder steps (Kasai et al., 2020). By projecting to the target vocabulary per source token, each source token can "vote" on a set of target tokens. The model automatically scales to longer sentences via the maxpooling operation, acting as a union of per-token choices, fulfilling goal (3). Max-pooling does not tie the predictions across timesteps as they would be with mean-pooling which would also depend on sentence length. Additionally, we factor in a sentence-level target token prediction based on the <EOS> token. The probability of a target word being present is represented by the source position with the highest evidence, backing off to a base probability of a given word via the bias vector b.
To learn the V × d + V parameters for NVS, we use a binary cross-entropy loss with the binary ground truth vector y ∈ R V , where each entry indicates the presence or absence of target token y i . We define the loss as where λ p is a weight for the positive class and

Setup
Our training setup is guided by best practices for efficient NMT to provide a strong low latency baseline: deep Transformer as encoder with a lightweight recurrent unit in the shallow decoder Kim et al., 2019), int8 quantization for CPU and half-precision GPU inference. We use the constrained data setting from WMT20 (Barrault et al., 2020b) with four language pairs English-German, German-English, English-Russian, Russian-English and apply corpus cleaning heuristics based on sentence length and language identification. We tokenize with sacremoses 4 and byte-pair encode (Sennrich et al., 2016) the data with 32k merge operations.  Table 5: Experimental results for unconstrained decoding (baseline), alignment-based VS with different k, and Neural Vocabulary Selection with varying λ. BLEU and COMET: mean and std of three runs with different random seeds. Human evaluation: source-based direct assessment renormalized so that the unconstrained baseline is at 100%, with 95% CI of a paired t-test. We ran two sets of human evaluations comparing 4 systems each. CPU/GPU: p90 latency in ms with 95% CI based on 30 runs with batch size 1, shown as a percentage of the baseline.
All models are Transformers (Vaswani et al., 2017) trained with the Sockeye 2 toolkit (Domhan et al., 2020). We release the NVS code as part of the Sockeye toolkit 5 . We use a 20-layer encoder and a 2-layer decoder with self-attention replaced by SSRUs (Kim et al., 2019).
NVS and NMT objectives are optimized jointly, but gradients of the NVS objective are blocked before the encoder. This allows us to compare the different vocabulary selection techniques on the same translation model that is unaffected by the choice of vocabulary selection. All vocabulary selection methods operate at the BPE level. We use the translation dictionaries from fast_align for alignment-based vocabulary selection. We use a Smaller k would lead to stronger quality degradations at lower latency. GPU and CPU latency is evaluated at single-sentence translation level to match real-time translation use cases where latency is critical. We evaluate translation quality using SacreBLEU (Post, 2018) 6 and COMET (Rei et al., 2020) 7 . Furthermore, we conduct human evaluations with two annotators on the subsets of new-stest2020 and IDTS test sentences where outputs differ between NVS λ = 0.9 (0.99) and align k = 200. Professional bilingual annotators rate outputs of four systems concurrently in absolute numbers with increments of 0.2 from 1 (worst) to 6 (best). Ratings are normalized so that the (unconstrained) baseline is at 100%. Complementary details on the training setup, vocabulary selection model size, human and latency evaluation setup can be found in Appendix A. Table 5 shows results of different vocabulary selection models on newstest2020 and the ITDS idiom set, compared to an unconstrained baseline without vocabulary selection. Automatic evaluation metrics show only very small differences between models. For three out of four language pairs, the alignment model with k = 200 performs slightly worse than the unconstrained baseline (0.2-0.3 BLEU). This corroborates existing work that quality measured by automatic metrics is not significantly affected by alignment-based vocabulary selection (Jean et al., 2015;Shi and Knight, 2017;Kim et al., 2019).

Results
However, human-perceived quality of alignmentbased vocabulary selection with k = 200 is consistently lower than the baseline. COMET, found to correlate better with human judgements than BLEU (Kocmi et al., 2021), only reflects this drop in two out of the four language pairs, considering confidence intervals across random seeds. Increasing k to 1000 closes the quality gap with respect to human ratings taking the confidence intervals into account. The same is true for vocabulary selection using NVS at both λ = 0.9 and λ = 0.99, where quality is also within the confidence intervals of the unconstrained baseline. However, NVS is consistently faster than the alignment-based model. For λ = 0.9 we see CPU latency improvements of 95 ms on average across language arcs. Increasing the threshold to λ = 0.99 latency compared to k = 1000 is reduced by 157 ms on average. The same trend holds for GPU latency but with smaller differences. Figure 3 compares the NVS model against the alignment model according to the speed/quality tradeoff reflected by average vocabulary size vs. reference token recall on new-stest2020. NVS consistently outperforms the alignment model, especially for small average vocabulary sizes where NVS achieves substantially higher recall. This demonstrates that the reduced vocabulary size and therefore faster decoder steps can amortize the cost of running the lightweight NVS model, which is fully parallelized across source tokens as part of the encoder.
To evaluate a domain adaptation setting, we finetune the NVS models on a set of 300 held-out sentences of idioms in sentence context for 10 epochs. For a fair comparison, we also include the same data for the alignment-based vocabulary selection. Figure 4 shows that NVS yields pareto optimality over the alignment model with and without domain adaptation to a small internal training set of idiomatic expressions in context. This highlights the advantage of NVS which is automatically updated during domain fine-tuning as it is part of a single model. See Appendix C for additional figures on the proverbs and ITDS test sets, where the same trend holds.

Analysis
Our proposed neural vocabulary selection model benefits from contextual target word prediction. We demonstrate this by comparing the predicted BOW when using the source sentence context versus predicting BOWs individually for each input word (which may consist of multiple subwords) and taking the union of individual bags. We use the NVS models that are adapted to a set of idiomatic expressions for this analysis to ensure that the unconstrained baseline models produce reasonable translations for the Idiom test set.  Table 6: Percentage of segments with all idiomatic reference tokens included in the BOW (All), or exclusively included in the contextual or non-contextual BOW (All excl) for NVS threshold λ. Table 6 shows the percentage of segments for which all reference tokens are included in the contextual vs. the non-contextual BOW for an acceptance threshold of 0.9 and 0.99. Independent of the threshold, predicting the BOW using source context yields significantly larger overlap with idiomatic reference tokens. We also measure the extent to which idiomatic reference tokens are included exclusively in the contextual or noncontextual BOW. For 32% of EN-DE segments, only the contextual BOW contains all idiomatic reference tokens. For non-contextual BOWs, this happens in only 1% of the segments (with λ=0.9). For EN-RU, the values are 38% versus 6%, respectively. This shows that the model makes extensive use of contextualized source representations in predicting the relevant output tokens for idiomatic expressions.    Figure 5 shows a few illustrative examples where the idiomatic reference is only reachable with the contextual BOW prediction. Consider the last example containing the English idiom "to wrap one's head around it". Even though the phrase is rather common in English, the German translation "verstehen" (to understand) would not be expected to rank high for any of the idiom source tokens. Evaluating the tokens in context however yields the correct prediction.

Related work
There are two dominant approaches to generate a restricted set of target word candidates (i) using an external model and (ii) using the NMT system itself.
In the first approach, a short-list of translation candidates is generated from word-alignments (Jean et al., 2015;Kim et al., 2019), phrase table, and the most common target words (Mi et al., 2016). L' Hostis et al. (2016) propose an additional method using support vector machines to predict target candidates from a sparse representation of the source sentence.
In the second approach, Sankaran et al. (2017) build alignment probability table from the softattention layer from decoder to encoder. However, applying their method to multi-head attention in Transformer is non-trivial as attention may not capture word-alignments in multiple attention layers . Shi and Knight (2017) use local sensitive hashing to shrink the target vocabu-Source Idiomatic target I thought I would be nervous , but I was cool as a cu cum ber .
die Ruhe selbst He decides that it is better to face the music , op ting to stay and conf ess .
sich den Dingen stellen The Classic Car Show is held in conjunction with Old Sett ler 's Day , rain or shine .
bei jedem Wetter Tools , discipline , formal methods , process , and profession alism were tou ted as silver bul lets : Wunder wa ffe They said he was ' a little bit under the weather ' . sich nicht wohl fühlen I still can 't wra p my head around it . verstehen Figure 5: Test inputs from the internal Idioms test set for which the highlighted tokens in the idiomatic reference are exclusively included in the contextual BOW (computed for idiom-adapted NVS model with λ = 0.9).
lary during decoding, though their approach only reduces latency on CPUs instead of GPUs.  reduce the softmax computation by first predicting a cluster of target words and then perform exact search (i.e., softmax) on that cluster. The clustering process is trained jointly with the translation process in their approach.
Closely related to our work is Weng et al. (2017), who predict all words in a target sentence from the initial hidden state of the decoder. Our NVS model differs from theirs in that we make a prediction for each source token and aggregate the results via max-pooling to scale with sentence length. Recent work of  illustrates the risk associated with reducing latency via vocabulary selection in domain-mismatched settings. Our work takes this a step further by providing a detailed analysis on the shortcomings of vocabulary selection and proposing a model to mitigate them.
Related to our findings on non-compositional expressions, Renduchintala et al. (2021) evaluate the effect of methods used to speed up decoding in Transformer models on gender bias and find minimal BLEU degradations but reduced gendered noun translation performance on a targeted test set.

Conclusions
Alignment-based vocabulary selection is a common method to heavily constrain the set of allowed output words in decoding for reduced latency with only minor BLEU degradations. We showed with human evaluations and a targeted qualitative analysis that such translations are perceivably worse. Even recent automatic metrics based on pre-trained neural networks, such as COMET, are only able to capture the observed quality degradations in two out of four language pairs. Human-perceived quality is negatively affected both for generic translations, represented by newstest2020, as well as for idiomatic translations. Increasing the vocabulary selection threshold can alleviate the quality issues at an increased single sentence translation latency. To preserve both translation latency and quality we proposed a neural vocabulary selection model that is directly integrated into the translation model. Such a joint model further simplifies the training pipeline, removing the dependency on a separate alignment model. Our model has higher reference token recall at similar vocabulary sizes, translating into higher quality at similar latency.

A Reproducibility Details
Data We use the constrained data setting from WMT20 (Barrault et al., 2020b) with four language pairs English-German, German-English, English-Russian, Russian-English. Noisy sentence pairs are removed based on heuristics, namely sentences with a length ratio > 1.5, > 70% token overlap, > 100 BPE tokens and those where source or target language does not match according to LangID (Lui and Baldwin, 2012) are filtered.
Model We train pre-norm Transformer (Vaswani et al., 2017) models with an embedding dimension of 1024 and a hidden dimension of 4096.

Model
Size align k=200 6,590,600 align k=1000 32,953,000 NVS 33,776,825  Table 7 compares the memory consumption of the different vocabulary selection models in terms of float numbers. We see that the NVS model requires a similar number of floating point numbers as the alignment-based model at k = 1000. Note, that this only represent the disk space requirements as other intermediate outputs would be required at runtime for either vocabulary selection model.
Training The NMT objective uses label smoothing with constant 0.1, the NVS objective sets the positive class weight λ p to 100,000. Models train on 8 Nvidia Tesla V100 GPUs on AWS p3.16xlarge instances with an effective batch size of 50,000 target tokens accumulated over 40 batches. We train for 70k updates with the Adam (Kingma and Ba, 2015) optimizer, using an initial learning rate of 0.06325 and linear warmup over 4000 steps. Checkpoints are saved every 500 updates and we average the weights of the 8 best checkpoints according to validation perplexity.
Inference For GPU latency, we run in halfprecision mode (FP16) on AWS g4dn.xlarge instances. CPU benchmarks are run with INT8 quantized models run on AWS c5.2xlarge instances. We decode using beam search with a beam of size 5. Each test set is decoded 30 times on different hosts, and we report the mean p90 latency with its 95% confidence interval. Alignment-based vocabulary selection includes the top k most frequently aligned BPE tokens for each source token based on a fast_align model trained on the same data as the translation model. NVS includes all tokens that are scored above the threshold λ. All vocabulary selection methods operate at the BPE level.
Evaluation Human Evaluations and COMET / BLEU use full precision (FP32) inference outputs. We decided to use FP32 for human evaluation as we wanted to evaluate the quality of the underlying model independent of whether it gets used on CPU or GPU and the output differences between FP16/FP32/INT8 being small. We report mean and standard deviation of SacreBLEU (Post, 2018) 8 and COMET (Rei et al., 2020) scores on detokenized outputs for three runs with different random seeds. For human evaluations, bilingual annotators see a source segment and the output of a set of 4 systems at once when assigning an absolute score to each output. The size of the evaluation set was 350 for EN-DE and EN-RU and 200 for DE-EN and RU-EN for newstest2020. We used the full sets of sentences differing between NVS λ = 0.9, align k = 200 for the ITDS test set (309 for EN-DE and 273 for DE-EN).
Adaptation For domain adaptation, we fine-tune the NVS model for 10 epochs using a learning rate of 0.0001 and a batch size of 2048 target tokens. To adapt the alignment-based vocabulary selection model, we include the adaptation data as part of the training data for the alignment model. We upsample the adaptation data by a factor of 10 for a comparable setting with NVS fine-tuning.

B Positive class weight ablation
Based on preliminary experiments we had used a weight for the positive class (λ p ) of 100k in the experiments in §4. Here the positive class refers to tokens being present on the target side and the negative class to tokens being absent from the target side. For a Machine Translation setting there are many more words that are not present than are present on the target side. The negative class therefore dominates the positive class. This can be  Table 8: Translation quality in terms of BLEU and COMET on newstest2020 with different weights for the positive class. auto x refers to setting the weight according to the ratio of the negative class to the positive class with a factor x.
counteracted by using a large value for the positive weight λ p . Instead of setting λ p to a fixed weight one can also define it as λ p = x n n n p with n p as the number of unique target words, n n = V − n p as the number of remaining words and x being a factor to increase the bias towards recall. This way the positive class and negative class are weighted equally. Table 8 shows the result of different positive weights, including the automatic setting according to the ratio (auto). We see that not increasing the weight of the positive class results in large quality drops. For positive weights > 1000 the quality differences are small. The auto setting provides an alternative that is easier to set than finding a fixed positive weight.
C Additional vocabulary size vs. recall plots Figures 6 and 7 provide results for the proverbs and ITDS test sets, respectively. We see the same trend across all test sets of NVS offering higher recall at the same vocabulary size compared to alignmentbased vocabulary selection. For the proverbs test set this is true both for the literal and the idiomatic translations.