A Framework for Bidirectional Decoding: Case Study in Morphological Inflection

Transformer-based encoder-decoder models that generate outputs in a left-to-right fashion have become standard for sequence-to-sequence tasks. In this paper, we propose a framework for decoding that produces sequences from the"outside-in": at each step, the model chooses to generate a token on the left, on the right, or join the left and right sequences. We argue that this is more principled than prior bidirectional decoders. Our proposal supports a variety of model architectures and includes several training methods, such as a dynamic programming algorithm that marginalizes out the latent ordering variable. Our model sets state-of-the-art (SOTA) on the 2022 and 2023 shared tasks, beating the next best systems by over 4.7 and 2.7 points in average accuracy respectively. The model performs particularly well on long sequences, can implicitly learn the split point of words composed of stem and affix, and performs better relative to the baseline on datasets that have fewer unique lemmas (but more examples per lemma).


Introduction
Transformer-based encoder-decoder architectures (Bahdanau et al., 2014;Vaswani et al., 2017) that decode sequences from left to right have become dominant for sequence-to-sequence tasks.While this approach is quite straightforward and intuitive, some research has shown that models suffer from this arbitrary constraint.For example, models that decode left-to-right are often more likely to miss tokens near the end of the sequence, while rightto-left models are more prone to making mistakes near the beginning (Zhang et al., 2019;Zhou et al., 2019a).This is a result of the "snowballing" effect, whereby the model's use of its own incorrect predictions can lead future predictions to be incorrect (Bengio et al., 2015;Liu et al., 2016).
We explore this issue for the task of morphological inflection, where the goal is to learn a mapping from a word's lexeme (e.g. the lemma walk) to a particular form (e.g.walked) specified by a set of morphosyntactic tags (e.g.V;V.PTCP;PST).This has been the focus of recent shared tasks (Cotterell et al., 2016(Cotterell et al., , 2017(Cotterell et al., , 2018;;McCarthy et al., 2019;Vylomova et al., 2020;Pimentel et al., 2021;Kodner et al., 2022;Goldman et al., 2023).Most approaches use neural encoder-decoder architectures, e.g recurrent neural networks (RNNs) (Aharoni and Goldberg, 2017;Wu and Cotterell, 2019) or transformers (Wu et al., 2021). 2 To our knowledge, Canby et al. ( 2020) is the only model that uses bidirectional decoding for inflection; it decodes the sequence in both directions simultaneously and returns the one with higher probability.
In this paper, we propose a novel framework for bidirectional decoding that supports a variety of model architectures.Unlike previous work ( §2), at each step the model chooses to generate a token on the left, generate a token on the right, or join the left and right sequences.
This proposal is appealing for several reasons.As a general framework, this approach supports a wide variety of model architectures that may be task-specific.Further, it generalizes L2R and R2L decoders, as the model can choose to generate sequences in a purely unidirectional fashion.Finally, the model is able to decide which generation order is best for each sequence, and can even produce parts of a sequence from each direction.This is particularly appropriate for a task like inflection, where many words are naturally split into stem and affix.For example, when producing the form walked, the model may chose to generate the stem walk from the left and the suffix ed from the right.
We explore several methods for training models under this framework, and find that they are highly effective on the 2023 SIGMORPHON shared task on inflection (Goldman et al., 2023).Our method improves by over 4 points in average accuracy over a typical L2R model, and one of our loss functions is particularly adept at learning split points for words with a clear affix.We also set SOTA on both the 2022 and 2023 shared tasks (Kodner et al., 2022), which have very different data distributions.

Prior Bidirectional Decoders
Various bidirectional decoding approaches have been proposed for tasks such as machine translation and abstractive summarization, including ones that use some form of regularization to encourage the outputs from both directions to agree (Liu et al., 2016;Zhang et al., 2019;Shan et al., 2019), or algorithms where the model first decodes the entire sequence in the R2L direction and then conditions on that sequence when decoding in the L2R direction (Zhang et al., 2018;Al-Sabahi et al., 2018).Still more methods utilize synchronous decoding, where the model decodes both directions at the same time and either meet in the center (Zhou et al., 2019b;Imamura and Sumita, 2020) or proceed until each direction's hypothesis is complete (Zhou et al., 2019a;Xu and Yvon, 2021).Lawrence et al. (2019) allows the model to look into the future by filling placeholder tokens at each timestep.

A Bidirectional Decoding Framework
The following sections present a general framework for training and decoding models with bidirectional decoding that is irrespective of model architecture, subject to the constraints discussed in §3.3.

Probability Factorization
For unidirectional models, the probability of an L2R sequence − → y = y 1 • • • y n or an R2L sequence ← − y = y n • • • y 1 given an input x is defined as where − → y i = y i or ← − y j = y n−j+1 is the ith or jth token in a particular direction.Generation begins with a start-of-sentence token; at each step a token is chosen based on those preceding, and the process halts once an end-of-sentence token is predicted.In contrast, our bidirectional scheme starts with an empty prefix $ and suffix #.At each timestep, the model chooses to generate the next token of either the prefix or the suffix, and then whether or not to join the prefix and suffix.If a join is predicted, then generation is complete.
We define an ordering o = o (1) • • • o (n) as a sequence of left and right decisions: that is, o (t) ∈ {L, R}.We use y (t) to refer to the token generated at time t under a particular ordering, and − → y (≤t) and ← − y (≤t) to refer to the prefix and suffix generated up to (and including) time t. 3 An example derivation of the word walked is shown below: Dropping the dependence on x for notational convenience, we define the joint probability of output sequence y and ordering o as where Q (t) is the probability of joining (or not joining) the prefix and suffix:

Likelihood and MAP Inference
To compute the likelihood of a particular sequence y, we need to marginalize over all orderings: P (y|x) = o P (y, o|x).Since we cannot enumerate all 2 |y| orderings, we have developed an exact O(|y| 2 ) dynamic programming algorithm, reminiscent of the forward algorithm for HMMs.
To simplify notation, let )) be the probability of generating the ith token from the left (or the jth token from the right), conditioned on − → y <i and ← − y <j , the prefix and suffix generated thus far: Let Q ij be the join probability for − → y ≤i and ← − y ≤j : Finally, denote the joint probability of a prefix − → y ≤i and suffix ← − y ≤j by f [i, j].We set the probability of an empty prefix and suffix (the base case) to 1: The probability of a non-empty prefix − → y ≤i and empty suffix ϵ can be computed by multiplying f [i−1, 0] (the probability of prefix − → y <i and empty suffix ϵ) by P L ( − → y i | − → y <i , ϵ) (the probability of generating − → y i ) and the join probability Q i0 : Analogously, we define Finally, f [i, j] represents the case where both prefix − → y ≤i and suffix ← − y ≤j are non-empty.This prefix-suffix pair can be produced either by appending − → y i to the prefix − → y <i and leaving the suffix unchanged, or by appending ← − y j to the suffix ← − y <j and leaving the prefix unchanged.The sum of the probabilities of these cases gives the recurrence: After filling out the dynamic programming table f , the marginal probability P (y) can be computed by summing all entries f [i, j] where i + j = |y|: If all local probabilities can be calculated in constant time, the runtime of this algorithm is O(|y| 2 ).
As an aside, the MAP probability, or the probability of the best ordering for a given sequence, can be calculated by replacing each sum with a max: The best ordering itself can be found with a backtracking procedure similar to Viterbi for HMM's.

Why does dynamic programming work?
Dynamic programming (DP) only works for this problem if the local probabilities (i.e. the token, join, and order probabilities) used to compute f [i, j] depend only on the prefix and suffix corresponding to that cell, but not on a particular ordering that produced the prefix and suffix.This is similar to the how the Viterbi algorithm relies on the fact that HMM emission probabilities depend only on the hidden state and not on the path taken.
To satisfy this requirement, the model's architecture should be chosen carefully.Any model that simply takes a prefix and suffix as input and returns the corresponding local probabilities is sufficient.However, one must be careful if designing a model where the hidden representation is shared or reused across timesteps.This is particularly problematic if hidden states computed from both the prefix and suffix are reused.In this case, the internal representations will differ depending on the order in which the prefix and suffix were generated, which would cause a DP cell to rely on all possible paths to that cell − thus breaking the polynomial nature of DP.

Training
We propose two different loss functions to train a bidirectional model.Based on our probability factorization, we must learn the token, join, and order probabilities at each timestep.
Our first loss function L xH (θ) trains each of these probabilities separately using cross-entropy loss.However, since ordering is a latent variable, it cannot be trained with explicit supervision.Hence, we fix the order probability to be 0.5 at each timestep, making all orderings equi-probable.
We then define S to contain the indices of all valid prefix-suffix pairs in a given sequence y: Finally, we define a simple loss L xH (θ) that averages the cross-entropy loss for the token probabilities (based on the next token in − → y or ← − y ) and join probabilities (based on whether the given prefix and suffix complete y): where Q ij is defined as in Equation 4. Due to the size of S, this loss takes O(|y| 2 ) time to train. 4 Given that a typical unidirectional model takes O(|y|) time to train, we also propose an O(|y|) approach that involves sampling from S; this is presented in Appendix F.
An alternative is to train with Maximum Marginal Likelihood (MML) (Guu et al., 2017;Min et al., 2019), which learns the order probabilities via marginalization.This is more principled because it directly optimizes P (y | x), the quantity of interest.The loss is given by L M M L (θ) = − log P (y|x; θ), which is calculated with the dynamic programming algorithm described in §3.2. 5earning the order probabilities enables the model to assign higher probability mass to orderings it prefers and ignore paths it finds unhelpful.
This loss also requires O(|y| 2 ) time to train.

Decoding
The goal of decoding is to find y such that y = argmax y P (y|x).Unfortunately, it is not computationally feasible to use the likelihood algorithm in §3.2 to find the best sequence y, even with a heuristic like beam search.Instead, we use beam search to heuristically identify the sequence y and ordering o that maximize the joint probability P (y, o|x): y, o = argmax y,o P (y, o|x) The formula for P (y, o|x) is given by Equation 3.
Each hypothesis is a prefix-suffix pair.We start with a single hypothesis: an empty prefix and suffix, represented by start-and end-of-sentence tokens.At a given timestep, each hypothesis is expanded by considering the distribution over possible actions: adding a token on the left, adding a token on the right, or joining.The k best continuations are kept based on their (joint) probabilities.Generation stops once all hypotheses are complete (i.e. the prefix and suffix are joined).

Model Architecture
Our architecture (Figure 1) is based on the character-level transformer (Wu et al., 2021), which has proven useful for morphological inflection.First, the input sequence x is encoded with a typical Transformer encoder; for the inflection task, this consists of the lemma (tokenized by character) concatenated with a separator token and set of tags.
Given a prefix − → y ≤i and suffix ← − y ≤j (as well as the encoder output), the decoder must produce each direction's token probabilities, the join probability, and the order probability.We construct the input to the decoder by concatenating the prefix and suffix tokens with some special classification tokens: The tokens c J , c O , c L2R , and c R2L are special classification tokens that serve a purpose similar to the CLS embedding in BERT (Devlin et al., 2019).We feed this input to a Transformer decoder as follows: These vectors are fed through their own linear layers and softmax, giving the desired probabilities: Since this architecture does have cross-attention between the prefix and suffix, the decoder hidden states for each prefix-suffix pair must be recomputed at each timestep to allow for DP (see §3.3).

Experimental Setup
Datasets.We experiment with inflection datasets for all 27 languages (spanning 9 families) from the SIGMORPHON 2023 shared task (Goldman et al., 2023).Each language has 10,000 training and 1,000 validation and test examples, and no lemma occurs in more than one of these partitions.We also show results on the 20 "large" languages from the SIGMORPHON 2022 shared task (Kodner et al., 2022), which has a very different sampling of examples in the train and test sets.A list of all languages can be found in Appendix A.
Tokenization.Both the lemma and output form are split by character; the tags are split by semicolon.For the 2023 shared task, where the tags are "layered" (Guriel et al., 2022), we also treat each open and closed parenthesis as a token.Appendix B describes the treatment of unknown characters.Model hyperparameters.Our models are implemented in fairseq (Ott et al., 2019).We experiment with small, medium, and large model sizes (ranging from ∼240k to ∼7.3M parameters).For each language, we select a model size based on the L2R and R2L unidirectional accuracies; this procedure is detailed in Appendix A.
The only additional parameters in our bidirectional model come from the embeddings for the 4 classification tokens (described in §4); hence, our unidirectional and bidirectional models have roughly the same number of parameters.
Training.We use a batch size of 800, an Adam optimizer (β 1 = 0.9, β 2 = 0.98), dropout of 0.3, and an inverse square root scheduler with initial learning rate 1e−07.Training is halted if validation accuracy does not improve for 7,500 steps.All validation accuracies are reported in Appepndix A.
Inference.Decoding maximizes joint probability P (y, o|x) using the beam search algorithm of §3.5 with width 5.In some experiments, we rerank the 5 best candidates according to their marginal probability P (y|x), which can be calculated with dynamic programming ( §3.2).Models.We experiment with the following models (see Appendices D and F for more variants): • L2R & R2L: Standard unidirectional transformer baselines, trained with the loss given in Equations 1 and 2.
• BL2: A naive "bidirectional" baseline that returns either the best L2R or R2L hypothesis based on which has a higher probability.• xH & MML: Our bidirectional transformer ( §4) trained under the cross-entropy or MML loss of §3.4, and decoded under P (y, o|x).
• xH-Rerank & MML-Rerank: These variants rerank the 5 candidates returned by beam search of the xH and MML models according to their marginal probability P (y|x).
• BL2-xH & BL2-MML: These methods select the best L2R or R2L candidate, based on which has higher marginal probability under the xH or MML model.
6 Empirical Results

Comparison of Methods
Accuracies averaged over languages are shown in Table 1; results by language are in Appendix D.
Baselines.BL2, which selects the higher probability among the L2R and R2L hypotheses, improves by more than 2.3 points in average accuracy over the best unidirectional model.This simple scheme serves as an improved baseline against which to compare our fully bidirectional models.
xH & MML.Our bidirectional xH model is clearly more effective than all baselines, having a statistically significant degradation in accuracy on only 3 languages.The MML method is far less effective, beating L2R and R2L but not BL2.MML may suffer from a discrepancy between training and inference, since inference optimizes joint probability while training optimizes likelihood.
xH-& MML-Rerank.Reranking according to marginal probability generally improves both bidirectional models.xH-Rerank is the best method overall, beating BL2 by over 1.75 points in average accuracy.MML-Rerank is better than either unidirectional model but still underperforms BL2.

BL2-xH & BL2-MML. Selecting the best L2R
or R2L hypothesis based on marginal probability under xH or MML is very effective.Both of these methods improve over BL2, which chooses between the same options based on unidirectional probability.BL2-xH stands out by not having a statistically significant degradation on any language.
Comparison with Prior SOTA.Goldman et al. ( 2023) presents the results of seven other systems submitted to the task; of these, five are from other universities and two are baselines provided by the organizers.The best of these systems is the neural baseline (a unidirectional transformer), which achieves an average accuracy of 81.6 points.Our best system, xH-Rerank, has an accuracy of 84.38 points, achieving an improvement of 2.7 points.

Improvement by Language
Table 1 shows that the best methods are xH-Rerank (by average accuracy) and BL2-xH (improves upon BL2 on the most languages).Figure 3 illustrates this by showing the difference in accuracy between each of these methods and the best baseline BL2.
The plots show that accuracy difference with BL2 has a higher range for xH-Rerank (−2.6% to 8.7%) than for BL2-xH (−0.5% to 5.8%).This is because xH-Rerank has the ability to generate new hypotheses, whereas BL2-xH simply discriminates between the same two hypotheses as BL2.

Length of Output Forms
Figure 2 shows the accuracies by output form length for BL2 and our best method xH-Rerank.xH-Rerank outperforms the baseline at every length (except 10), but especially excels for longer outputs (≥ 16 characters).This may be due to the bidirectional model's decreased risk of "snowballing": it can delay the prediction of an uncertain token by  generating on the opposite side first, a property not shared with unidirectional models.

How does generation order compare with
the morphology of a word?
In this section we consider only forms that can be classified morphologically as prefix-only (e.g. will |walk) or suffix-only (e.g.walk|ed), because these words have an obvious split point.Ideally, the bidirectional model will exhibit the desired split point by decoding the left and right sides of the form from their respective directions.
We first classify all inflected forms in the test set as suffix-only, prefix-only, or neither.We do this by aligning each lemma-form pair using Levenshtein distance and considering the longest common substring that has length of at least 3 to be the stem. 6If the inflected form only has an affix attached to the stem, then it is classified as prefix-only or suffixonly; otherwise, it is considered neither.7   Finally, Figure 5 shows the percentage of words with a clear affix on which each bidirectional model has the correct analysis.A correct analysis occurs when the model joins the left and right sequences at the correct split point and returns the correct word.
It is immediately obvious that the MML models tend to exhibit the correct analysis, while the xH models generally have the wrong analysis.This make sense because MML learns the latent order- ing variable, unlike cross-entropy.Despite MML's success at learning this morphology, it tends to have lower accuracy than xH; we explore this by breaking down accuracy by word type in Figure 6.
Learning the ordering seems to be harmful when there is no obvious affix: compared with BL2, MML barely drops in accuracy on prefix-and suffix-only forms but degrades greatly when there is no clear split.The xH model, which does not learn ordering, improves in all categories.
We conclude that MML models better reflect the stem-affix split than cross-entropy models but have lower accuracy.Improving the performance of MML models while maintaining their linguistic awareness is a promising direction for future work.

Ablation Study: Does bidirectional decoding help?
In this section, we analyze to what extent the bidirectional models' improvement is due to their ability to produce tokens from both sides and meet at any position.To this end, we force our trained xH and MML models to decode in a fully L2R or R2L manner by setting the log probabilities of tokens in the opposite direction to −∞ at inference time.
The results are shown in Table 4.The bidirectional models perform poorly when not permitted to decode from both sides.This is particularly detrimental for the MML model, which is expected as the marginalized training loss enables the model to assign low probabilities to some orderings.Clearly, our MML model does not favor unidirectional orderings.
The xH model, on the other hand, does not suffer as much from unidirectional decoding.Since it was trained to treat all orderings equally, we would expect it to do reasonably well on any given ordering.Nonetheless, it still drops by about 7 points for L2R decoding and about 13 points for R2L decoding.This shows that the full bidirectional generation procedure is crucial to the success of this model.

Results on 2022 Shared Task
We also train our bidirectional cross-entropy model on the 2022 SIGMORPHON inflection task (Kodner et al., 2022), which, unlike the 2023 data, does have lemmas that occur in both the train and test sets.The results are shown in Table 2.All of our methods (including the baselines) outperform the best submitted system (Yang et al., 2022) on the 2022 data; our best method BL2-xH improves by over 4.7 points in average accuracy.However, only BL2-xH outperforms the baseline BL2 (barely), which is in stark contrast to the 2023 task, where all cross-entropy-based methods beat the baseline considerably.To make the comparison between the years more fair, we evaluate the 2022 models only on lemmas in the test set that did not occur in training.Again, only BL2-xH outperforms the baseline, this time by a wider margin; xH and xH-Rerank still underperform.
We posit that this discrepancy is likely due to the considerably different properties of the 2022 and 2023 datasets, which are shown in Table 3.The 2023 languages have far fewer unique lemmas and have many more forms per lemma.Hence, it seems that our bidirectional model improves much more compared with the baseline when there are fewer but more "complete" paradigms.
This investigation shows that the performance of inflection models depends substantially on the data sampling, which is not always controlled for.Kodner et al. (2023)  Table 4: Ablation study on 2023 dataset.Macroaveraged accuracies for bidirectional models decoded using the method of §3.5 (Bidi), or when forced to decode in an L2R or R2L manner.Bidi-2 indicates the outcome when selecting between the forced unidirectional decodings based on which has a higher probability.
The unidirectional models (Uni) indicate the accuracies of standard unidirectional transformers and BL2.
does not explicitly examine paradigm "completeness", which should be a focus in future studies.

Conclusion
We have proposed a novel framework for bidirectional decoding that allows a model to choose the generation order for each sequence, a major difference from previous work.Further, our method enables an efficient dynamic programming algorithm for training, which arises due to an independence assumption that can be built into our transformer-based architecture.We also present a simple beam-search algorithm for decoding, the outputs of which can optionally be reranked using the likelihood calculation.Our model beats SOTA on both the 2022 and 2023 shared tasks without resorting to data augmentation.Further investigations show that our model is especially effective on longer output words and can implicitly learn the morpheme boundaries of output sequences.
There are several avenues for future research.One open question is the extent to which data augmentation can improve accuracy.We also leave open the opportunity to explore our bidirectional framework on other sequence tasks, such as machine translation, grapheme-to-phoneme conversion, and named-entity transliteration.Various other architectures could also be investigated, such as the bidirectional attention mechanism of Zhou et al. (2019b) or non-transformer based approaches.Finally, given the effectiveness of MML reranking, it could be worthwhile to explore efficient approaches to decode using marginal probability.

Limitations
We acknowledge several limitations to our work.For one, we only demonstrate experiments on the inflection task, which is fairly straightforward in some ways: there is typically only one output for a given input (unlike translation, for example), and a large part of the output is copied from the input.It would be informative to test the efficacy of our bidirectional framework on more diverse generation tasks, such as translation or question-answering.
From a practical standpoint, the most serious limitation is that, in order to use dynamic programming, the model architecture cannot be trained with a causal mask: all hidden states must be recomputed at each timestep.Further, our xH and MML schemes are quadratic in sequence length.These two properties cause the training time of our bidirectional method to be O(|y| 4 ) in runtime rather than O(|y| 2 ) (like the standard transformer).8Alleviating these constraints would enable a wider variety of experiments on tasks with longer sequences.

A Datasets, Hyperparameter Tuning, & Validation Accuracies
The languages in the SIGMORPHON 2022 and 2023 datasets are listed in Tables 7 and 8.We experiment with small, medium, and large model sizes for each language, whose configurations and approximate number of parameters can be found in  For each language, we train L2R and R2L models (with random initialization) for each hyperparameter size (a total of 6 models per language), and select a size based on the average of the L2R and R2L validation accuracies.The model sizes chosen for each language, along with each language's validation accuracies, are reported in Tables 13 and  14.
Note that the number of parameters vary slightly among languages due to different vocabulary sizes (i.e.number of unique characters in the training set), and the bidirectional models also have a small number of extra parameters due to the additional classification tokens described in §4.

B Handling Unknown Characters
If an unknown character is encountered in a lemma at test time, then a special UNK character is used; however, this character is not explicitly trained.If an UNK character is predicted by the model, then we replace it with the first (leftmost) unknown character in the lemma; if no such character exists then it is ignored.
We adopt a special scheme for Japanese, which has a very high number of unknown characters.All characters that occur fewer than 100 times in the training set are considered "unknown".If a lemma has n unknown tokens, then these are replaced with UNK 1 , ..., UNK n ; the corresponding tokens in the inflected form are replaced as well.In this way, the model can learn to copy rare or unknown characters to their appropriate locations in the output.At test time, each predicted unknown token is replaced with its corresponding character in the lemma.

C Tempering the Order Distribution at Train Time
Initial empirical results showed that training with MML loss caused the model to quickly reach a "degenerate" state, where every sequence was decoded in the same direction.To encourage the model to explore different orderings at an early stage, we temper the order probabilities over a warmup period.The temperature is degraded from initial temperature τ 0 to 1 over a period of W steps as follows: The parameter a controls how fast the shift occurs, and n corresponds to the training step.This temperature is applied to the softmax of order probabilities for the first W steps of training.

D All Results
The accuracies for all languages in our study are shown in Table 9 (2023 data) and Table 10 (2022 data).These tables also display L2R-Rerank (which reranks the 5 candidates from the L2R model's beam search under the cross-entropy or MML model), R2L-Rerank, and (L2R+R2L)-Rerank (which reranks the 10 candidates returned from the L2R and R2L's beam search under the cross-entropy or MML model).

E Oracle Scores
Table 11 shows the oracle score for each method; this gives an upper bound for choosing among a set of hypotheses.We see that both xH-Rerank and BL2-Rerank approach their respective bounds: the average accuracy for xH-Rerank is within 1 point of its oracle score, and the average accuracy for BL2-xH is within 2 points of its oracle score.

F Cross-entropy with Random Path (xH-Rand)
The cross-entropy loss presented in §3.4 requires enumerating all O(|y| 2 ) prefix-suffix pairs.Here, we propose an O(|y|) variant in which the join loss is averaged over a random set of prefix-suffix pairs for each word.Specifically, the set S is defined such that there is only one (i, j) pair for each 1 ≤ k ≤ |y| where i + j = k.Otherwise, this loss L xH-Rand (θ) is the same as the cross-entropy loss of §3.4.Since this loss has an O(|y|) runtime, it has the same complexity as a standard unidirectional loss (assuming all local probabilities take constant time to compute).Table 12 compares the accuracies of this model with the other bidirectional variants discussed in §6.Reranking xH-Rand is slightly better than not reranking, and this performs well: its average accuracy is almost 1 percentage point higher than BL2 and it improves on 15/27 languages.xH-Rand is better than MML but not as good as xH.Nonetheless, its faster runtime and competitive performance makes this a useful method.

G.1 Accuracy by Length
Figure 2 in §7.1 compares the accuracy of our bidirectional method xH-Rerank with that of the baseline BL2 by the length of the output form.Figure 7 shows a similar comparison for BL2-xH (our other best method) with BL2; consistent with the analysis of §6.2, there is less of a difference between these methods, but BL2-xH does equal or outperform BL2 at all lengths.
Figure 8 shows the distribution of output form length across all languages.

G.2 Accuracy by Part-of-Speech
Figures 10 and 9 compare the accuracies of xH-Rerank and BL2-xH (our best bidirectional methods) with the accuracy of BL2 by part-of-speech.We see that xH-Rerank maintains or improves accuracy over BL2 in all categories except V.MSDR (masdars), and BL2-xH maintains or improves accuracy in all categories except V.MSDR and V.PTCP (participles).These categories make up a small fraction of the data; this can be seen in Figure 11, which shows the distribution of part-of-speech categories across all languages.

G.3 What orderings does each method prefer?
In this section, we investigate the ordering preferences for each method: does a model prefer to decode words entirely in the L2R or R2L direction, or partially in each direction?These results can be seen for each language in Figure 12.
Both the xH and MML methods have a strong tendency to decode words partially in each di-  rection; however, MML models clearly have a higher proportion of words decoded from both directions than their xH counterparts.Out of the words decoded entirely in one direction, the xH model shows a slight preference for R2L generations, though most languages have words decoded from both directions.On the other hand, for the MML model, no language shows a preference for R2L generations over L2R generations; in fact, R2L generations are extremely rare for the MML models.

G.4 Empirical Inference Times
Given that our bidirectional model must recompute previous hidden states at each timestep during inference (see §4), we wish to compare the empirical slowdown in decoding for our bidirectional models compared with unidirectional models.The average number of seconds taken to decode 50 examples is shown in Table 6.
Recomputing hidden states at each step slows   down inference by a factor of about 3.However, in practice, we barely notice the difference on this task, as the test sets have only 1,000 examples each.Given the strong outperformance of the bidirectional methods over the unidirectional baselines (and even over the naive bidirectional baseline BL2), one must therefore make a tradeoff between time and performance.

Figure 1 :
Figure 1: Architecture for bidirectional decoding model.Depicts the token inputs for the verb walked at timestep t = 3 with − → y ≤2 = $wa and ← − y ≤1 = d#.All inputs are surrounded by a rectangle.

Figure 2 :
Figure 2: Accuracies of xH-Rerank and BL2 by Word Length.Average accuracies of BL2 and xH-Rerank models over all languages, grouped by length (number of characters) of the output form.

Figure 3 :
Figure 3: Accuracy Improvement by Language.Difference in accuracy between our best models (xH-Rerank and BL2-xH) and our best baseline BL2.

Figure 4 :
Figure 4: Morphology of words in test set.Percentage of forms that are suffix-only, prefix-only, or neither in the test set for each language.

Figure 5 :
Figure 5: Analysis for prefix-and suffix-only words.Percentage of forms for each training method that (1) are correct and whose ordering agrees with the form's morphology; (2) are correct but whose ordering does not agree with the form's morphology; and (3) are incorrect.

Figure 6 :
Figure 6: Accuracy of models by word type.Accuracy of words that are suffix-or prefix-only, or neither.

Figure 4
Figure 4 shows the percentage of words that are prefix-only, suffix-only, or neither for each language.Most languages favor suffix-only inflections, although Swahili strongly prefers prefixes and several other languages have a high proportion of words without a clear affix.

Figure 7 :
Figure 7: Accuracies of BL2-xH and BL2 by Word Length.Average accuracies of BL2 and BL2-xH models over all languages, grouped by length (number of characters) of the output form.

Figure 8 :
Figure 8: Number of Test Examples by Length.Number of test examples across all languages by number of characters in (correct) output form.

Figure 9 :
Figure 9: Accuracies of BL2-xH and BL2 by Part of Speech.Accuracies of BL2 and BL2-xH models averaged over all languages, grouped by part of speech.

Figure 10 :
Figure 10: Accuracies of xH-Rerank and BL2 by Part of Speech.Accuracies of BL2 and xH-Rerank models averaged over all languages, grouped by part of speech.

Figure 11 :
Figure 11: Number of Test Examples by Part-ofspeech.Number of test examples across all languages by part-of-speech.

Figure 12 :
Figure 12: Ordering choices.Percentage of examples for each language and training method that are decoded fully L2R, fully R2L, or partially from each direction.

Table 1 :
Accuracies of Methods.Accuracy averaged over all languages in the SIGMORPHON 2023 shared task, and number of languages whose accuracy equals or exceeds (≥) the best baseline BL2.The entry Goldman et al. (2023) shows the accuracy of the next best system submitted to the shared task.Also shows number of languages with a statistically significant improvement (>) or degradation (<), or no statistically significant change (=), in accuracy compared with BL2 using a paired-permutation test(Zmigrod et al., 2022)with α = 0.05.The best entry in each column is bold.See Table9in Appendix D for results by language.

Table 2 :
Comparison of 2022 and 2023 results.
makes progress on this matter, but DzmitryBahdanau, Kyunghyun Cho, and Yoshua Bengio.2014.Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473.

Table 6 :
Inference times.Average time to perform inference on 50 test examples averaged over all languages on the 2023 dataset on 4 NVIDIA V100 GPU's.

Table 7 :
2023 Dataset Information.Information on each language in the 2023 dataset, including language family and genus, baseline accuracies on test set, and model size chosen (based on validation accuracies).

Table 9 :
All Accuracies (2023 data).A number is starred (*) if it shows a statistically significant difference with the best baseline BL2; a number is colored in green if it improves over BL2 (regardless of significance) using a paired permutation test(Zmigrod et al., 2022); and a number is bold if it is the best for the language.

Table 10 :
All Accuracies (2022 data).A number is colored in green if it improves over BL2, and a number is bold if it is the best for the language.

Table 11 :
Oracle Accuracies (2023 data).Accuracies of each method if an oracle were used to select among the hypotheses returned from beam search.In the case of BL10, an oracle chooses out of the 10 candidates returned from L2R and R2L's beam search; in the case of BL2, an oracle chooses between the best L2R and best R2L hypothesis.

Table 12 :
Random Cross-Entropy Accuracies (2023 data).A number is colored in green if it improves over BL2, and a number is bold if it is the best for the language.

Table 13 :
Validation Accuracies (2023 data).Validation accuracies for each language in the 2023 dataset.Validation accuracies on unidirectional models are used for hyperparameter selection.The bidirectional validation accuracies(xH, xH-Rand, MML)are reported for the chosen model size for each language.