Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

In many natural language processing (NLP) tasks the same input (e.g. source sentence) can have multiple possible outputs (e.g. translations). To analyze how this ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models we measure sentence-level uncertainty by computing the degree of overlap between references in multi-reference test sets from two different NLP tasks: machine translation (MT) and grammatical error correction (GEC). At both the sentence- and the task-level, intrinsic uncertainty has major implications for various aspects of search such as the inductive biases in beam search and the complexity of exact search. In particular, we show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity such as MT but not to less uncertain tasks such as GEC. Furthermore, we propose a novel exact n-best search algorithm for neural sequence models, and show that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences.


Introduction
With the advent of deep learning, many applications of machine learning have converged on a similar set of methods and models. For example, the Transformer (Vaswani et al., 2017) sequenceto-sequence architecture is ubiquitous in various fields of natural language processing (NLP) such as machine translation (MT), grammatical error correction (GEC), speech recognition (Karita et al., 2019), etc., and has also been applied successfully to other tasks such as computer vision (Dosovitskiy et al., 2021). Recent large pre-trained NLP models * Research done during internship at Google Research, now at Meta AI. such as BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020), T5 (Raffel et al., 2020), RoBERTa , and XLNet  are all based on the Transformer, with relatively minor changes to the architecture itself.
We show that despite this architectural uniformity the learned distribution over sequences has strikingly different characteristics for different NLP tasks. Inspired by Ott et al. (2018) we identify intrinsic uncertainty -the nature of some NLP tasks to allow multiple viable outputs for a given input 1to be a major factor that shapes the search space of Transformer models and determines its tractability. In machine translation (MT) -a task known to have high intrinsic uncertainty (Padó et al., 2009;Dreyer and Marcu, 2012;Ott et al., 2018) -Transformer models suffer from a high number of beam search errors (Stahlberg and Byrne, 2019), an inadequacy of the mode (Eikema and Aziz, 2020), and translation performance degradation with large beam sizes (Koehn and Knowles, 2017) (also known as the "beam search curse"). In contrast, for the correction of writing errors in text (grammatical error correction -GEC) (Brockett et al., 2006), a task with a lower level of uncertainty (Bryant and Ng, 2015), none of these pathologies are evident. This pattern holds even at the sequence-level: input sentences with high uncertainty tend to result in more search errors and a less tractable search space. To study the influence of uncertainty on sequences around the mode, we propose an exact n-best search algorithm for neural sequence models. We show that the probability mass covered by the n-best candidates differs markedly between certain and uncertain tasks and sentences, which shows that intrinsic uncertainty also affects the spread of probability mass and thus the model uncertainty. We confirm recent work showing that beam search has drawbacks as a decoding scheme for MT. Nevertheless, it is effective for GEC, a problem where modes are adequate, search errors are rare, and the n-best lists cover a large fraction of the probability mass.

Measuring Intrinsic Uncertainty
Intrinsic uncertainty refers to the inherent nature of some NLP tasks to allow for more than one feasible output for a given input. For example, intrinsic uncertainty in MT stems from the fact that there are often several semantically equivalent translations for the same source sentence, or that the translation into a highly inflected language is sometimes underspecified (Ott et al., 2018). Studies have shown that even for tasks like GEC, annotators do not always agree (Tetreault and Chodorow, 2008;Rozovskaya and Roth, 2010;Bryant and Ng, 2015), but the level of intrinsic uncertainty is arguably lower than for MT because there is a limited number of ways to correct an ungrammatical sentence. We propose a simple way to measure sentencelevel output uncertainty by making use of multireference test sets. For an n-way annotated sentence with references y 1 , ..., y n we define the uncertainty u as the average relative edit distance between two references: Avg. edit distance between refs.
(1) where d edit (·, ·) denotes the Levenshtein distance. Fig. 1 presents this uncertainty score for one MT test set and two GEC test sets. MT-ende is the official WMT19 English-German test set (Barrault et al., 2019) paired with the additional human-annotated "newstest2019 AR" references provided by Freitag et al. (2020). 2 GEC-conll14 uses the 10 references published by Bryant and Ng (2015) for the CoNLL-2014 shared task on GEC (Ng et al., 2014), and GEC-jfleg is a 4-reference GEC test set that represents "a broad range of language proficiency levels" (Napoles et al., 2017). Our uncertainty measure reflects our intuition that MT is a significantly more uncertain task than GEC. 3 For both tasks the uncertainty increases with the sentence length as longer sentences typically have more feasible mappings than shorter ones. We use the edit distance rather than task-specific metrics like BLEU (Papineni et al., 2002) or BLEURT (Sellam et al., 2020) since they are designed to be robust against uncertainty effects such as reordering or semantically equivalent references, precisely the kinds of effects we aim to capture with u. We follow Bryant and Ng (2015) by not using interannotator agreement statistics like Cohen's κ (Cohen, 1960) since they are more appropriate for the classification into single, well-defined categories.

Mode-seeking Search
Neural sequence-to-sequence models define a probability distribution P (y|x) over target sequences y given a source sequence x: log P (y|x) = |y| j=1 log P (y j |y j−1 1 , x).
(2) Sequences are typically computed over a subword (Sennrich et al., 2016;Kudo and Richardson, 2018) vocabulary V and end with a special end-of-sentence symbol </s>: where V * is the Kleene closure over V which includes the empty sequence . Since sequence models are usually trained to maximize the probability of the sequences in the training set, a common strategy to use such a model for inference is to search for the most likely output sequence y * , also known as the mode of the model distribution: 4 y * = arg max y P (y|x).

N -best Search
In addition to our investigations into the mode we also examine the cumulative probability mass that is covered by the n best hypotheses. If a hypothesis set covers a large fraction of the entire probability mass it approximates the full model distribution well. Approximating the full model distribution is useful for various methods such as minimum risk training (Shen et al., 2016), reinforcement learning (Williams, 1992;Ranzato et al., 2015), minimum Bayes risk decoding (Kumar and Byrne, 2004;Stahlberg et al., 2017;Eikema and Aziz, 2020), etc. Ott et al. (2018) argued that the fraction of probability mass which is covered by a fixed number of candidates reflects the model uncertainty on the sequence level. We show that this model uncertainty is in line with our notion of intrinsic uncertainty that we measure with u (Sec. 2). To that end, we propose a generalization of the exact search algorithm of Stahlberg and Byrne (2019) that is able to find the n global best hypotheses rather than the single best one. Similarly to the single-best algorithm, we use the monotonicity of neural sequence model scores:   complete (i.e. ending with the end-of-sentence symbol </s>) hypothesis score during search, and use it to safely prune entire subspaces using Eq. 5. In contrast, we keep track of the n-th best complete hypothesis score by keeping the n best complete hypotheses in a priority queue. Our exact n-best search algorithm is listed in Algorithm 1. Note that we recover the DFS scheme of Stahlberg and Byrne (2019) with n = 1.

Experimental Setup
We trained four Transformer neural machine translation (NMT) models (Table 1) 5 We selected these language pairs to experiment with different training set sizes ( Table 2). The MT training sets were filtered using language ID and simple length-based heuristics, and split into subwords using joint 32K SentencePiece (Kudo and Richardson, 2018) models. For training our GEC model we used the hyper-parameters from Table 1 and followed the three-stage training recipe of Stahlberg and Kumar (2021) using the 32K SentencePiece model from Raffel et al. (2020). All our models were trained until convergence on the development set using the LAMB (You et al., 2020) optimizer in JAX (Bradbury et al., 2018) (Dahlmeier and Ng, 2012) and on the JFLEG test set (Napoles et al., 2017, GEC-jfleg) using GLEU (Napoles et al., 2015).

Results
In this work our focus is to analyze the impact of intrinsic uncertainty on search. Thus we keep our setup simple, reproducible, and computationally economical rather than obtain new state-of-the-art results. Nevertheless, Tables 3 and 4 show that our baselines are not unreasonably far off from the best results in the literature given that the systems we compare with are often highly engineered and use many more parameters. Xia (Holtzman et al., 2020) are by far the most common choices.
In this section we explore how uncertainty changes the mode and the ability of beam search to find it. A well-known pathology of NMT models is the "beam search curse" (Koehn and Knowles, 2017): Increasing the beam size improves the predictive log-probabilities of the hypotheses, but it leads to worse translation quality due to the NMT model error of preferring short translations. We replicate this result in Fig. 2: BLEU scores for MT initially improve over greedy search at smaller beam sizes but after reaching a peak at beam size of 4, we observe a dramatic drop in BLEU. The trajectory of the blue curves (GEC) is markedly different: the performance does not drop for large beams but saturates instead. The beam search curse affects tasks with high intrinsic uncertainty like MT but spares more certain tasks like GEC although both tasks use the same neural Transformer architecture.
To determine why the beam size affects NMT and GEC so differently we ran the exact decoding algorithm of Stahlberg and Byrne (2019) to find the global best hypotheses and counted search errors, i.e. the number of sentences in the test set for which beam search does not find the global best sequence. Our results confirm the findings of Stahlberg and Byrne (2019) that increasing the beam sizes leads to fewer NMT search errors (Fig. 3). Among our MT language pairs, English-German (MT-ende) suffers the most from the beam search curse and the proportion of search errors in the test set. This is possibly because translation from English to German typically results in a longer sequence and thus more uncertainty. GEC differs significantly from NMT in the total number of search errors. For MT, even with a very large beam size of 500, beam search does not find the mode for more than 20% of the sentences in any language pair. In contrast for GEC, we do not observe any search errors for beam sizes larger than 10. This suggests that task uncertainty determines the tractability of the search space and particularly the search for the mode.
Uncertainty also determines the computational costs of exact search. To abstract away from hardware and implementation details, we measure the time complexity of exact search by counting the number of explored states, i.e. the number of forward passes through the model, which is identical to the number of recursive calls of Algorithm 1. 6 Fig. 4 plots the fraction of sentences in the test set for which the exact search explores a certain maximum number of states to terminate. For example, exact search returned the mode for around 50% of the MT sentences after exploring no more than 1000 states. With the same computational budget, however, it was able to find the mode for nearly 100% of the GEC sentences (blue curves). For some of the MT sentences, exact search needed to explore around 100K states, or even more in the case of Lithuanian-English (orange curve).

Sentence-level uncertainty
In the previous paragraph we showed that MT, a task with high intrinsic uncertainty, suffers from more beam search errors and a less tractable search space than GEC, a task with relatively low intrinsic uncertainty. Figs. 5 and 6 demonstrate that this pattern is not only present at the task-level but also at the sentence-level. First, the bar charts show that there is a general trend towards more search errors and more explored states 6 For example, the number of explored states in standard beam search is the beam size times the target sequence length. for longer sentences. Longer input sentences often result in higher entropy distributions (i.e. more uncertainty) since there are usually more ways to map a long sentence than a short one. We also see a pattern within each group, i.e. within a reference length interval, that shows that sentences with higher uncertainty u result in more search errors and a longer exact search runtime even when compared to other sentences with similar lengths. Table  5 lists the test set level correlation coefficients.

The Spread of Probability Mass
We argued in Sec. 4 that the ability to approximate the entire search space with a fixed set of candidates can be useful in training (Shen et al., 2016;Williams, 1992;Ranzato et al., 2015) and decoding (Kumar and Byrne, 2004;Eikema and Aziz, 2020), and proposed an exact n-best search algorithm. However, finding the exact n-best hypotheses is computationally much more expensive than finding the single-best hypothesis (mode). Therefore, to keep the runtime under control, we stopped n-best decoding after 1M explored states. Fig. 7 shows that the 1M threshold is not reached for n = 1 for any sentence: it was always possible to find and verify the mode. We can guarantee that the n = 100 best candidates returned by our algorithm are indeed the global best ones for around 90% of the MT-deen sentences (right end of the green curve in Fig. 7). The blue curves in Fig. 7 suggest that as before the GEC search space is much more tractable given that our exact n-best search algorithm was able to find the 100 global best hypotheses for all GEC sentences before reaching 1M explored states. Indeed, Fig. 8 shows that exact 100-best search terminated with fewer than 10K explored states for almost all GEC sentences while the pruning criterion in Eq. 5 is much less effective for the NMT search space (green curves in Fig. 8).
The cumulative probability mass of the set returned by exact n-best search is an upper bound for the cumulative probability mass of any hypothesis set with a cardinality of n. Despite the high number of search errors (Fig. 3), the probability mass covered by the n-best beam search hypotheses is very close to this upper bound. Fig. 9 shows that for n = 100 that difference is less than 0.001 for all setups except MT-fien. Since the difference in probability mass is negligible we ran our subsequent investigations of probability mass with beam search instead of exact search to save computational costs.     Table 5: Spearman's rank correlation coefficient ρ between the uncertainty u and the number of greedy search errors, the number of explored DFS states, and the 100-best cumulative probability mass. All correlations are significant with a p-value of less than 0.00001.
blue curves in Fig. 10). Fig. 11 provides even more insight: A beam size of 1000 covers 40% of the probability mass for nearly all sentences in the GEC test sets. Even more practical beam sizes of 10 cover more than half of the probability mass for around 75% of the GEC-conll14 sentences. The same plot looks very different for MT (Fig. 12): Covering half the probability mass is only possible for a tiny fraction of the MT sentences.
Sentence-level uncertainty In Sec. 6.1 we reported that the effects caused by intrinsic uncertainty on the ability to find the mode are visible at both the task-and the sentence-levels. Similarly, we can track down our observations about how uncertainty determines the probability mass of n-best lists at the sentence level. Fig. 13 shows that the cumulative probability mass in the n-best list decreases for longer sentences as the mappings of long sentences are more uncertain. Again, the trend within a group in Fig. 13 suggests that even among sentences with similar lengths, n-best lists for uncertain sentences (higher u) accumulate less probability mass. We make analogous observations for NMT (Fig. 14), although the total n-best probability mass is much smaller than for GEC.

Related Work
Ambiguity is one of the core challenges in MT, a fact that is supported (inter alia) by the long history of designing evaluation metrics that are robust against it (Papineni et al., 2002;Banerjee and Lavie, 2005;Sellam et al., 2020). In this work we examine the impact of ambiguity on the NMT search space, and show how it is related to various well-  known issues of NMT models like the beam search curse (Koehn and Knowles, 2017), a pathology that has also been linked to the local normalization in sequence models (Sountsov and Sarawagi, 2016;Murray and Chiang, 2018) or poor model calibration (Kumar and Sarawagi, 2019).
Our work is heavily inspired by Ott et al. (2018) who analyzed different kinds of uncertainty in NMT. In particular, they found that NMT spreads out the probability mass over a large number of candidates, and connected the beam search curse with uncertainty. We confirm their results and extend their line of research along the following directions: We introduce a measure for uncertainty in multi-reference test sets, and show that the negative effects of uncertainty are visible even on the sentence level. Second, we propose an exact nbest search algorithm and demonstrate how it can be used to analyze the spread of probability mass. Third, we focus not only on MT but also on GEC. Stahlberg and Byrne (2019) showed that beam search errors often obscure the length deficiency of the NMT modes, and reducing search errors by using large beams exposes this model error. In this work, we found that these mechanics are limited to NMT: GEC does not suffer from the beam search curse since search errors are rare and modes are not too short. Eikema and Aziz (2020) suggested that picking a hypothesis based solely on probability is erratic because NMT spreads out the probability mass over a large set of hypotheses with similar probabilities. Therefore, alternative approaches that in addition to the probabilities incorporate MT-specific metrics such as BLEU (Papineni et al., 2002) or BLEURT (Sellam et al., 2020) have recently been in focus of research, including minimum Bayes risk decoding Aziz, 2020, 2021;Müller and Sennrich, 2021), Monte-Carlo tree search (Leblond et al., 2021), and energy-based (Bhattacharyya et al., 2021) or discriminatively trained (Lee et al., 2021) rerankers. Our work on how uncertainty determines the spread of probability mass is relevant to those approaches.

Conclusion
We identified a major culprit behind various inference-related issues in sequence-to-sequence models such as the intractability of the search space, degenerate large beam or exact search outputs and the large spread in probability mass over the output space. This factor is intrinsic uncertainty -the existence of multiple ways to correctly map an input sequence. We measured the intrinsic uncertainty of input sentences as the degree of agreement between multiple references and showed that ambiguous sentences typically result in a higher number of beam search errors and an exceedingly flat output distribution. We also find that known NMT pathologies such as the beam search curse or inadequate modes do not extend to less ambiguous tasks like GEC despite using the same neural architecture.