Searching for Search Errors in Neural Morphological Inflection

Neural sequence-to-sequence models are currently the predominant choice for language generation tasks. Yet, on word-level tasks, exact inference of these models reveals the empty string is often the global optimum. Prior works have speculated this phenomenon is a result of the inadequacy of neural models for language generation. However, in the case of morphological inflection, we find that the empty string is almost never the most probable solution under the model. Further, greedy search often finds the global optimum. These observations suggest that the poor calibration of many neural models may stem from characteristics of a specific subset of tasks rather than general ill-suitedness of such models for language generation.


Introduction
Neural sequence-to-sequence models are omnipresent in the field of natural language processing due to their impressive performance. They hold state of the art on a myriad of tasks, e.g., neural machine translation (NMT; Ott et al., 2018b) and abstractive summarization (AS; Lewis et al., 2019). Yet, an undesirable property of these models has been repeatedly observed in word-level tasks: When using beam search as the decoding strategy, increasing the beam width beyond a size of k = 5 often leads to a drop in the quality of solutions (Murray and Chiang, 2018;Yang et al., 2018;Cohen and Beck, 2019). Further, in the context of NMT, it has been shown that the empty string is frequently the most-probable solution under the model (Stahlberg and Byrne, 2019). Some suggest this is a manifestation of the general inadequacy of neural models for language generation tasks (Koehn and Knowles, 2017;Kumar and Sarawagi, 2019;Holtzman et al., 2020;Stahlberg, 2020); in this work, we find evidence demonstrating otherwise. k = 1 k = 10 k = 100 k = 500 NMT 63.1% 46.1% 44.3% 6.4% MI 0.8% 0.0% 0.0% 0.0% Sequence-to-sequence transducers for characterlevel tasks often follow the architectures of their word-level counterparts (Faruqui et al., 2016;Lee et al., 2017), and have likewise achieved state-of-the-art performance on e.g., morphological inflection generation (Wu et al., 2020) and grapheme-to-phoneme conversion (Yolchuyeva et al., 2019). Given prior findings, we might expect to see the same degenerate behavior in these models-however, we do not. We run a series of experiments on morphological inflection (MI) generators to explore whether neural transducers for this task are similarly poorly calibrated, i.e. are far from the true distribution p(y | x). We evaluate the performance of two character-level sequenceto-sequence transducers using different decoding strategies; our results, previewed in Tab. 1, show that evaluation metrics do not degrade with larger beam sizes as in NMT or AS. Additionally, only in extreme circumstances, e.g., low-resource settings with less than 100 training samples, is the empty string ever the global optimum under the model.
Our findings directly refute the claim that neural architectures are inherently inadequate for modeling language generation tasks. Instead, our results admit two potential causes of the degenerate behavior observed in tasks such as NMT and AS: (1) lack of a deterministic mapping between input and output and (2) a (perhaps irreparable) discrepancy between sample complexity and training resources. Our results alone are not sufficient to accept or reject either hypothesis, and thus we leave these as future research directions.

Neural Transducers
Sequence-to-sequence transduction is the transformation of an input sequence into an output sequence. Tasks involving this type of transformation are often framed probabilistically, i.e., we model the probability of mapping one sequence to another. On many tasks of this nature, neural sequence-tosequence models (Sutskever et al., 2014;Bahdanau et al., 2015) hold state of the art.
Formally, a neural sequence-to-sequence model defines a probability distribution p θ (y | x) parameterized by a neural network with a set of learned weights θ for an input sequence x = x 1 , x 2 , . . . and output sequence y = y 1 , y 2 , . . . . Morphological inflection and NMT are two such tasks, wherein our outputs are both strings. Neural sequence-to-sequence models are typically locally normalized, i.e. p θ factorizes as follows: Given a vocabulary V, each conditional p θ is a distribution over V ∪ {EOS} and y 0 := BOS. We consider p θ (y | x) to be well-calibrated if its probability estimates are representative of the true likelihood that a solution y is correct.
Morphological Inflection. In the task of morphological inflection, x is an encoding of the lemma concatenated with a flattened morphosyntactic description (MSD) and y is the target inflection. As a concrete example, consider inflecting the German word Bruder into the genitive plural, as shown in Tab. 2. Then, x is the string B r u d e r GEN PL and y is the string B rü d e r . As this demonstrates, morphological inflection generation is, by its nature, modeled at the character level (Faruqui et al., 2016;Wu and Cotterell, 2019), i.e., our target vocabulary V is a set of characters in the language. Note that y ∈ V * , but x ∈ V * due to the additional encoding of the MSD. This stands in contrast to NMT, which is typically performed on a (sub)word level, making the vocabulary size orders of magnitude larger.  Another important differentiating factor of morphological inflection generation in comparison to many other generation tasks in NLP is the one-toone mapping between source and target. 1 In contrast, there are almost always many correct ways to translate a sentence into another language or to summarize a large piece of text; this characteristic manifests itself in training data where a single phrase has instances of different mappings, making tasks such as translation and summarization inherently ambiguous.

Decoding
In the case of probabilistic models, the decoding problem is the search for the most-probable sequence among valid sequences V * under the model p θ : This problem is also known as maximum-aposteriori (MAP) inference. Decoding is often performed with a heuristic search method such as greedy or beam search (Reddy, 1977), since performing exact search can be computationally expensive, if not impossible. 2 While for a deterministic task, greedy search is optimal under a Bayes optimal model, 3 most text generation tasks benefit from using beam search. However, text quality almost invariably decreases for beam sizes larger than k = 5. This phenomenon is sometimes referred to as the beam search curse, and has been investigated in detail by a number of scholarly works (Koehn and Knowles, 2017;Murray and Chiang, 2018;Yang et al., 2018;Stahlberg and Byrne, 2019;Cohen and Beck, 2019;Eikema and Aziz, 2020).  Table 3: Prediction accuracy (averaged across languages) by decoding strategy for Transformer and HMM. We include breakdown for low-resource and high-resource trained models. k indicates beam width.
Exact decoding can be seen as the case of beam search where the beam size is effectively stretched to infinity. 4 By considering the complete search space, it finds the globally best solution under the model p θ . While, as previously mentioned, exact search can be computationally expensive, we can employ efficient search strategies due to some properties of p θ . Specifically, from Eq. (1), we can see that the scoring function for sequences y is monotonically decreasing in t. We can therefore find the provably optimal solution with Dijkstra's algorithm (Dijkstra, 1959), which terminates and returns the global optimum the first time it encounters an EOS. Additionally, to prevent a large memory footprint, we can lower-bound the search using any complete hypothesis, e.g., the empty string or a solution found by beam search (Stahlberg and Byrne, 2019;Meister et al., 2020). That is, we can prematurely stop exploring solutions whose scores become less than these hypotheses at any point in time. Although exact search is an exponential-time method in this setting, we see that, in practice, it terminates quickly due to the peakiness of p θ (see App. A). While the effects of exact decoding and beam search decoding with large beam widths have been explored for a number of word-level tasks (Stahlberg and Byrne, 2019;Cohen and Beck, 2019;Eikema and Aziz, 2020), to the best of our knowledge, they have not yet been explored for any character-level sequence-to-sequence tasks.

Experiments
We run a series of experiments using different decoding strategies to generate predictions from morphological inflection generators. We report results for two near-state-of-the-art models: a multilingual Transformer (Wu et al., 2020) and a (neuralized) hidden Markov model (HMM; Wu and Cotterell, 2019). For reproducibility, we mimic their pro-  posed architectures and exactly follow their data pre-processing steps, training strategies and hyperparameter settings. 5 Data. We use the data provided by the SIGMOR-PHON 2020 shared task (Vylomova et al., 2020), which features lemmas, inflections, and corresponding MSDs in the UniMorph schema (Kirov et al., 2018) in 90 languages in total. The set of languages is typologically diverse (spanning 18 language families) and contains both high-and low-resource examples, providing a spectrum over which we can evaluate model performance. The full dataset statistics can be found on the task homepage. 6 When reporting results, we consider languages with < 1000 and ≥ 10000 training samples as low-and highresource, respectively.
Decoding Strategies. We decode morphological inflection generators using exact search and beam search for a range of beam widths. We use the SGNMT library for decoding (Stahlberg et al., 2017) albeit adding Dijkstra's algorithm.

Results
Tab. 3 shows that the accuracy of predictions from neural MI generators generally does not decrease when larger beam sizes are used for decoding; this observation holds for both model architectures. While it may be expected that  models for low-resource languages generally perform worse than those for high-resource ones, this disparity is only prominent for HMMs, where the difference between high-and low-resource accuracy is ≈ 24% vs. ≈ 10% for the Transformers. Notably, for the HMM, the global optimum under the model is the empty string far more often for low-resource languages than it is for high-resource ones (see Tab. 5). We can explicitly see the inverse relationship between the log-probability of the empty string and resource size in Fig. 1. In general, across models for all 90 languages, the global optimum is rarely the empty string (Tab. 5). Indeed, under the Transformer-based transducer, the empty string was never the global optimum. This is in contrast to the findings of Stahlberg and Byrne (2019), who found for word-level NMT that the empty string was the optimal translation in more than 50% of cases, even under state-of-the-art models. Rather, the average log-probabilities of the empty string (which is quite low) and the chosen inflection lie far apart (Tab. 4).

Discussion
Our findings admit two potential hypotheses for poor calibration of neural models in certain language generation tasks, a phenomenon we do not observe in morphological inflection. First, the tasks in which we observe this property are ones that lack a deterministic mapping, i.e. tasks for which there may be more than one correct solution for any given input. As a consequence, probability mass may be spread over an arbitrarily large number of hypotheses (Ott et al., 2018a;Eikema and Aziz, 2020). In contrast, the task of

HMM Transformer
Overall 2.03% 0% Low-resource 8.65% 0% High-resource 0.0002% 0%  morphological inflection has a near-deterministic mapping. We observe this empirically in Tab. 4, which shows that the probability of the global optimum on average covers most of the available probability mass-a phenomenon also observed by Peters and Martins (2019). Further, as shown in Tab. 6, the dearth of search errors even when using greedy search suggests there are rarely competing solutions under the model. We posit it is the lack of ambiguity in morphological inflection that allows for the well-calibrated models we observe. Second, our experiments contrasting high-and low-resource settings indicate insufficient training data may be the main cause of the poor calibration in sequence-to-sequence models for language generation tasks. We observe that models for MI trained on fewer data typically place more probability mass on the empty string. As an extreme example, we consider the case of the Zarma language, whose training set consists of only 56 samples. Under the HMM, the average log-probability of the generated inflection and empty string are very close (−8.58 and −8.77, respectively). Furthermore, on the test set, the global optimum of the HMM model for Zarma is the empty string 81.25% of the time.
From this example, we can conjecture that lack of sufficient training data may manifest itself as the (relatively) high probability of the empty string or the (relatively) low probability of the optimum. We can extrapolate to models for NMT and other word-level tasks, for which we frequently see the above phenomenon. Specifically, our experiments suggest that when neural language generators frequently place high probability on the empty string, there may be a discrepancy between the available training resources and the number of samples needed to successfully learn the target function. While this at first seems an easy problem to fix, we expect the number of resources needed in tasks such as NMT and AS is much larger than that for MI if not due to the size of the output space alone; perhaps so large that they are essentially unattainable. Under this explanation, for certain tasks, there may not be a straightforward fix to the degenerate behavior observed in some neural language generators.

Conclusion
In this work, we investigate whether the poor calibration often seen in sequence-to-sequence models for word-level tasks also occurs in models for morphological inflection. We find that character-level models for morphological inflection are generally well-calibrated, i.e. the probability of the globally best solution is almost invariably much higher than that of the empty string. This suggests the degenerate behavior observed in neural models for certain word-level tasks is not due to the inherent incompatibility of neural models for language generation. Rather, we find evidence that poor calibration may be linked to specific characteristics of a subset of these task, and suggest directions for future exploration of this phenomenon.  Table 7: Average time (s) for inflection generation by decoding strategy. Breakdown by resource group is included.