Enhancing the Transformer Decoder with Transition-based Syntax

Notwithstanding recent advances, syntactic generalization remains a challenge for text decoders. While some studies showed gains from incorporating source-side symbolic syntactic and semantic structure into text generation Transformers, very little work addressed the decoding of such structure. We propose a general approach for tree decoding using a transition-based approach. Examining the challenging test case of incorporating Universal Dependencies syntax into machine translation, we present substantial improvements on test sets that focus on syntactic generalization, while presenting improved or comparable performance on standard MT benchmarks. Further qualitative analysis addresses cases where syntactic generalization in the vanilla Transformer decoder is inadequate and demonstrates the advantages afforded by integrating syntactic information.


Introduction
In parallel to the impressive achievements of large neural networks in a variety of NLP fields, more and more work emphasizes the importance of the inductive biases models possess and the types of generalizations they make (Welleck et al., 2021;Csordás et al., 2021;Ontanón et al., 2021). Syntactic generalization has been repeatedly identified as a problem in text generation (Linzen and Baroni, 2020;Hu et al., 2020), an issue that we address here. Importantly, language models may fail, sometimes unexpectedly, on constructions that can be reliably parsed using standard syntactic parsers. In this work, we propose a method for incorporating syntax into the decoder to assist in mitigating these challenges, focusing on NMT as a test case.
The use of (mostly syntactic) structure in machine translation dates back to the early days of the field (Lopez, 2008). While focus has shifted to string-to-string methods since the introduction of neural methods, considerable work has shown gains from integrating linguistic structure into NMT and text generation technologies. We briefly survey such methods in §7.
Incorporating target-side syntax has been less frequently addressed than source-side syntax, possibly due to the additional conceptual and technical complexity it entails, as it requires to jointly generate the translation and its syntactic structure. In addition to linearizing the structure into a string, that allows to easily incorporate source and target structure (Aharoni and Goldberg, 2017b;, several works generated the nodes of the syntactic tree using RNNs (Gū et al., 2018;. Others have shown gains from multi-task training of a decoder with a syntactic parser (Eriguchi et al., 2016). However, we are not aware of any Transformer-based architecture to support the integration of target-side structure in the form of a tree or a graph. Addressing this gap, we propose a flexible architecture for integrating graphs into a Transformer decoder.
Our approach is based on predicting the output tree as a sequence of transitions ( §3), following the transition-based tradition in parsing (Nivre, 2003, and much subsequent work). The method (presented in §4) is based on generating the structure incrementally, as a sequence of transitions, as is customary in transition-based parsers. However, unlike standard linearization approaches, our proposed decoder re-encodes the intermediate graph (and not only the generated tokens), thus allowing the decoder to take advantage of the hitherto produced structure in its further predictions.
In §2, we discuss the possibilities offered by such decoders, that do not only auto-regress on their previous outputs, but also on (symbolic) structures defined by those outputs. Indeed, a decoder thus built can condition both on information it did not predict (e.g., external knowledge bases) and information predicted later on. We introduce bidirectional attention into the decoder, that allows token representations to encode the following tokens that were predicted. This is similar to the bidirectional attention in the encoder, where any token can attend to any token, and not only to preceding ones.
Our architecture is flexible, supporting decoding not only into trees, but into any graph structure for which a transition system exists. We test two architectures for incorporating the syntactic graph. One inputs the graph into a Graph Convolutional Network (GCN; Kipf and Welling, 2016), and another dedicates an attention head to point at the syntactic parent of each token, which does not yield any increase in the number of parameters.
We assess in §6 the impact of the proposed architecture on syntactically challenging translation cases (Choshen and Abend, 2019) and in general. We experiment with a 4 layered model in three target languages, and a 6 layered on En-De. Due to the high computational cost, we experiment with the model on a single language pair only. We find that on the syntactic challenge sets proposed by Choshen and Abend (2019), the proposed decoder achieves substantial improvements over the vanilla decoder, which do not diminish (and even slightly improve) when increasing the size of the model. In addition, evaluating on the standard MT benchmarks, we find that the syntactic decoders outperform the vanilla Transformer for the smaller model size on all examined language pairs: on the English-German (En-De) and German-English (De-En) challenge sets and on En-De, De-En and English-Russian (En-Ru) test sets, and obtain comparable results to the vanilla when experimenting with a larger model on En-De. Finally, we analyse the different modifications in isolation, finding that the ablated versions' performance resides between the full model and the vanilla decoder.

Decoding Approach
Jo@@ hn put the coals out root nsubj det obj compound:prt Example 1: Target-side structure reduces the ambiguity of "put". De source: "John löschte die Kohlen" (lit. John put-out the coals).
Disambiguating and connecting distant words is a known challenge in NMT (Avramidis et al., 2020). In Example 1 to disambiguate "put" as not having the sense "lay" but "extinguish", "out" must be considered. To achieve this from the autoregressed output, the decoder's representation may need to be re-computed after predicting "out". We note that while source-side information can potentially be used to disambiguate "put", it may still be beneficial to enhance the auto-regressive decoder with disambiguating information. Current implementations impose an architectural bias, namely, a decoded token's representation may not attend to future tokens. Transformer models mask attention in the following manner (we did not find any alternative methods): Token embeddings attend only to previously generated tokens, even when the following tokens are already known. This practice "ensures that the predictions for position i can depend only on the known outputs at positions less than i" (Vaswani et al., 2017).
We propose to allow attending to any known token ( Fig. 1), as done on the encoder side. Due to its conceptual resemblance to Bidirectional RNN, we name this Bidirectional Transformer or biTran.
Formally, let o 1 . . . o n be a hitherto predicted sequence and d max sentence length. Attention is sof tmax (L + M ) where L ∈ R d×d are the logits and M ∈ R d×d is a mask. Hence, M (i, j) = −∞ masks a token j from representation i.
This change does not introduce any new parameters or hyperparameters, but still increases the expressivity of the model. We note, however, that this modification does prevent some commonly implemented speed-ups relying on unidirectionality (e.g., in NEMATUS; . Apart from the technical contribution, we emphasize that this and the following approaches take advantage of attention-based models being stateless. Transformers can, therefore, be viewed as conditional language models, namely as models for producing a distribution for the next word, given the generated prefix and source sentence. Viewing them as such opens possibilities that were not native to RNNs, such as predicting only partial outputs and conditioning on per-token or nonautoregressed context (see App. A).

Transition-based Structure Generation
We turn to describe how we represent structure within the proposed decoder.
We generate the target-side structure with a transition-based approach, motivated by the practical strength of such methods, as well as their sequential nature, which fits neural decoders well. We therefore augment the vocabulary with transitions. Our work is inspired by RNNG (Dyer et al., 2016), a conceptually similar architecture that was developed for RNNs. At each step, the input to the decoder includes the tokens and the parse graph that was generated thus far. As edges and their tokens are not generated simultaneously (but rather by different transitions; see below), we rely on bidirectional attention to update the past embeddings when a new edge connects previously generated tokens. In this section, we present the syntactic transitions and in the next ( §4), the ways we incorporate it back into the model.
In this work, we represent syntax through Universal Dependencies (UD; Nivre et al., 2016), but note that other syntactic and semantic formalisms that have transition-based parsers (Hershcovich et al., 2018;Stanojević and Steedman, 2020;Oepen et al., 2020) fit the framework as well. We select UD due to its support for over 100 languages and its status as the de facto standard for syntactic representation.
We base our transition system on arc-standard (Nivre, 2003), which can produce any projective tree. Both contain a transition connecting two words by a labeled edge. However, we replace SHIFT that reads the next word by SUBWORD t generating a new sub-word t. Sub-words are generated successively until a full word is formed. To avoid suboptimal representation of transition tokens, we add the edges going through them to the graph (e.g., the edge LEFT-ARC:det det −→ the).
We denote with f the transition functions updating a word stack Σ and the labeled graph G. If a, b are the top and second words in Σ respectively, and x a transition, then f (x; Σ) is defined as: For brevity, we denote an edge from/to every subword of a as an edge from/to a. Overall, the translation sequence to create the graph in Example 1 is: Jo@@ hn put LEFT-ARC:nsubj the coals LEFT-ARC:det RIGHT-ARC:obj out RIGHT-ARC:compound:prt (more details in App. B)

Regressing on Generated Structure
As discussed in §2, the state-less nature of the Transformer allows re-encoding not only the previous predictions, but any information that can be computed based on them. So far, we proposed to autoregress on the syntactic structure, token by token. However, as f is deterministic, learning to emulate it, is pointless. Instead, we can autoregress on the generated graph itself, G = f (o 1 . . . o n ), as well as the encoder output, o 1 . . . o n .
Our approach is modular and works with any graph encoding method. We experiment with two prominent methods for source-side graph encoding.
GCN Encoder. Graph Convolutional Networks (GCN; Kipf and Welling, 2016) are a type of graph neural network. GCNs were used successfully by previous work to encode source-side syntactic and semantic structure for NMT (Bastings et al., 2017;Marcheggiani et al., 2018). The GCN layers are stacked immediately above the embedding layer.
The GCN contains weights per edge type and label as well as gates, that allow placing less emphasis on the syntactic cue if the network so chooses. Gating is assumed to help against noisy structure, which machine output is expected to be. See ablation experiments to assess the impact of gating in §6.3.
Following Kipf and Welling (2016), we introduce 3 edge types. Self from a token to itself, Left to the parent tokens and Right from the parents.
A GCN layer over input layer h, a node v and a graph G containing nodes of size d, with activation ρ, edge directions dir, labels lab, and a function N from a node in G to its neighbors is u,v) and g u,v is the applied gate: where σ is the logistic sigmoid function and are the learned parameters for the GCN.
Attending to Parent Token. The second reencoding method we test, PARENT, dedicates an attention head only to the parent(s) of the given token. Commonly, the parent is given by an external parser (Hao et al., 2019) or learned locally in each layer, to focus the attention (Strubell et al., 2018). Unlike such approaches, we define the parents by the self-generated graph. To allow ignoring it when preferable or when no parent was generated, we also allow attending to the current token. To recap, for a token o i , we mask all but o i and its parents. PARENT differs from GCN considerably. On the one hand, PARENT requires minimal architectural changes and no additional hyperparameters. It also affects different network parts, some attention heads, rather than an additional embedding. On the other hand, only GCN represents the labels and the whole graph, specifically children. By considering both architectures, we show that graph methods for the encoder (Bastings et al., 2017) may be easily adapted to the decoder, demonstrating the flexibility of the proposed framework.

Experimental Setup
Metrics. We report both BLEU (Papineni et al., 2002) and chrF+ (Popovic, 2017) and note that chrF+ has been deemed more reliable for current technology (Ma et al., 2019).
Unable to identify a preexisting implementation, we implemented labeled sparse GCNs with gating in Tensorflow. Implementation mostly focused on memory considerations, and was optimized for runtime when possible. More on implementation details, filtering and preprocessing in App. B.
Language Pairs. We experiment on 3 language pairs with 3 target languages: English (De-En), German (En-De) and Russian (En-Ru). We use the WMT16 data (Bojar et al., 2016) for En-De, and either the clean News commentary or the full noisy WMT20 data (Barrault et al., 2020) for En-Ru.
Test sets. Newstest 2012 served as a development set. To measure the overall system performance we used newstest 2013-15.
To test syntactic generalization, we used the challenge sets by Choshen and Abend (2019). Those are sub-sets of the books and newstest corpora in En↔De, automatically filtered by a syntactic parser to contain lexical long-distance dependencies. i.e., sentences where two or more nonconsecutive words correspond to a single word. E.g., "put ... out" in Example 1 corresponds to the German "löschte" (see also Example 2). Previous work has shown such phenomena to be challenging for present-day NMT systems.
Improving the automatic measures on one such challenge set indicates better performance on a specific phenomenon, while better overall challenge set performance implies better handling of lexical long-distance dependencies.
The various challenge set settings are represented as a 3-tuple (dir, p, dom), corresponding to the direction, inspected phenomenon and domain. Direction can be either "source" or "target", indicating whether the long distance dependency is in the source or the target (i.e., in the reference). A more effective representation of the target-side syntax should improve target challenges and potentially also the source side's, by increasing the Source der gruppe, an die sich der Plan richtet Gloss the group to which himself the plan aims Ref.
the group to whom the plan is aimed PARENT the group to which the plan is aimed Vanilla the group aimed at the plan

Results
We compare the syntactic generalization abilities of the different decoders in §6.1, and continue by examining their overall performance ( §6.2). We then assess the contribution of the components of the system through ablation experiments ( §6.3) and evaluate the effects of noisy training data ( §6.4).

Syntactic Generalization
We evaluate the syntactic generalization abilities of the models using the syntactic challenge sets. Results (Table 1) show that the medium PARENT (GCN) improves over the Vanilla in 18 (20) of 20 target challenge settings and 19 (19) of 20 in the source challenges. The large model improves in 18/20 of the challenges and gains seem similar or even larger. The latter results suggest that simply using larger models is unlikely to address these gaps in syntactic generalization. See also E. With the large models, PARENT performs comparably to the vanilla (Table 2b), despite the superior results it obtains on syntactic generalization.

Ablation Experiments
In order to better understand the contribution of different parts of the architecture and to compare them, we consider ablated versions (See Tables 2 and App. E). Differences are small but consistent. In one, Linearized, we train the vanilla Transformer over the transitions, linearized to a string, without encoding the graph through GCN or attention. This is reminiscent of the approaches taken by Aharoni and Goldberg (2017b); , albeit with a different form of linearization. Results place Linearized in a clear place: consistently better than the structure-unaware models but not as good as the structure-aware ones.
We turn to experiment with ablated versions of the GCN decoder. Unlabeled ignores the labels and relies only on the graph structure, while Ungated, also removes the gate g. Gating was hypothesized to be important to avoid over-reliance on the erroneous edges (Bastings et al., 2017;Hao et al., 2019). As our graphs are generated by the network, rather than fed into it by an external parser, this is a good place to test this hypothesis.
Comparing GCN with and without labels, we find their contribution to be limited. Despite some improvement in overall BLEU, as often as not, Unlabeled is better on the challenges. We advise caution, however, in interpreting these results, as they may not necessarily indicate that syntactic labels are redundant. There are two technical points to consider. First, the labels' role in GCNs is small, they contribute many hyperparameters, while only affecting a bias term. Presumably, this is an inefficient use that should be addressed in future work. Second, the labels are incorporated also through the transitions, and hence have token embeddings. These could compensate for the disregard of labels.
Unlike labels, gating appears to be crucial. The Ungated scores are lower than the Unlabeled variant in 34/40 challenges. This might indirectly support the hypothesis that gating aids with erroneous parses. It also hints introducing similar mechanisms to PARENT may also be beneficial.
Even BiTran provides a small (up to .28 BLEU,.42 chrF+) but consistent improvement. Indeed, it outperforms the vanilla on average and in 10/12 scores in each pair. We observe a similar trend in the challenge sets (   conclusion, bidirectionality in itself is somewhat beneficial, both in general and specifically for aggregating the syntactically correct context tokens.
As a next step, we compare GCN ablations to PARENT. Like unlabeled GCNs, PARENT does not rely on the labels and provides a different way to incorporate the graph structure, which is still shown to be successful. We note that while labels are not incorporated, they appear as transition inputs and can be attended to. Comparing the two architectures, PARENT shows significant gains over Unlabeled GCN. Despite being easier to implement and being much lighter in terms of memory, time and hyperparameters, PARENT generally outperforms Unlabeled GCN in both performance and specific challenges. PARENT is slightly better than unablated? GCN on En-De and slightly worse on De-En. It is better on 3 of 5 De-En phenomena and one of the En-De, when compared to the GCN variant.

Noise Robustness
Preliminary experiments indicate that syntactic architectures may be more sensitive to noisy training data than the vanilla Transformer, possibly amplifying parser errors. To test this, we trained on the full WMT data for En-Ru, which is mostly crawled data. Results show that the improvement in chrF+ is smaller, 1 point instead of 1.5-2.5 in other settings, and BLEU scores are somewhat worse (see App. §E.1). It seems then that overall, the inclusion of noisy data diminishes the relative improvement.
An alternative explanation to these results may be that our methods contribute less in the presence of more training data. Our positive results on En-De and De-En, that use relatively large amounts of data (4.5M sentence pairs), show that if this is indeed the case, saturation is slow.

Qualitative Analysis
To complement the automatic challenges, we compile a set of 99 simple subject-verb-object sentences where the German object and subject can swap locations without affecting the meaning. We created three sets of sentences, where the case marking for the subject and object may or may not be ambiguous. For example, Das Pferd bringt der Vater and Der Vater bringt das Pferd both translate to the father brings the horse. Such examples are of particular interest to us here, as the case of the first noun phrase is ambiguous ("Das Pferd" could be either a subject or an object) and is only disambiguated by the case marking of the second one. These cases require some understanding of the syntax to translate correctly. See App. §C.
A native-speaking German annotator, fluent in English, then evaluated the medium-size PARENT and Vanilla outputs on these sentences. The ambiguous examples were found to be challenging for both systems, especially the ambiguous case markings. However, overall, PARENT is more robust to the changes in order. Interestingly, both models (PARENT more consistently) translate some sentences to passive voice, keeping both the (changed) order and the meaning.

Related Work
While there are indications that Transformers implicitly learn some syntactic structure when trained as language models or as NMT (e.g., Jawahar et al., 2019; Manning et al., 2020), it is not at all clear whether such information replaces the utility of incorporating syntactic structure. Indeed, a considerable body of work suggests the contrary.
Much previous work tested RNN-based and attention-based systems for their ability to make structural generalizations (Welleck et al., 2021;Csordás et al., 2021;Ontanón et al., 2021). Syntactic generalizations seem to pose a particularly difficult challenge (Ravfogel et al., 2019;McCoy et al., 2019). Moreover, while NMT often succeeds in translating inter-dependent linearly distant words, their performance is unstable: the same systems may well fail on other "obvious" cases of the same phenomena (Belinkov and Bisk, 2017; Choshen and Abend, 2019). This evidence provides motivation for efforts such as ours, to incorporate linguistic knowledge into the architecture.
Syntactic structure was used to improve various tasks, including code generation (Chakraborty et al., 2018), question answering (Bogin et al., 2020), automatic proof generation (Gontier et al., 2020) language modelling (Wilcox et al., 2020) and grammatical error correction (Harer et al., 2019). Such approaches, however, are task specific. E.g., the latter makes strong conditional independence assumptions, and is less suitable for MT where the source and target syntax may diverge considerably.
In NMT, Aharoni and Goldberg (2017a) proposed to replace the source and target tokens with a linearized constituency graph.  proposed a similar approach using CCG parses.  (2020) proposed a tree-based attention mechanism to encode source syntax;  incorporated the first layers of a parser in addition to the source-side token embeddings. Relatedly, previous work showed gains from using syntactic information for preprocessing (Ponti et al., 2018;Zhou et al., 2019a).
Much fewer works focused on structure-based decoding. Eriguchi et al. (2017), building on Dyer et al. (2016), train a decoder in a multi-task setting of translation and parsing. Notably, unlike in the method we propose, their generated translation is not constrained by the parse during the decoding. Few works proposed alternating between two connected RNNs one translating and one creating a linearized graph using a tree-based RNN  or transition-based parsing . Gū et al. (2018) both parse and generate, using a recursive RNN representation.
Other work changed RNNs (Tai et al., 2015) or Transformers to include structural inductive biases, but without explicit syntactic information.  suggested an unsupervised way to train Transformers that learn tree-like structures following the intuition that such representations are more similar to syntax. Shiv and Quirk (2019) encoded tree-structured data in the positional embeddings.

Discussion
The work we presented is motivated from several angles. First, we note that Transformers are trained in the same way that former sequence to sequence models are trained (e.g., RNNs) and to many, they are just a better architecture for the same task. Instead, our work emphasizes the possibility of conditional training using Transformers; namely, Transformers should be able to predict the third token given the first two, even without previously predicting them. Although generally not implemented this way, Transformers are already conditional networks, and allow for flexibility not found in RNNs.
The finding that MT quality changes between beginnings and ends of predicted sentences both in RNNs and in Transformers Zhou et al., 2019b), further motivates conditional translation. This is often explained by lack of context and disregard for the future tokens. Such future context is used by humans (Xia et al., 2017) and can potentially improve NMT (Tu et al., 2016;Mi et al., 2016). Moreover, as the encoded input is constant throughout the prediction, the varying performance is likely due to the decoder. Attending to all predictions from lower layers, as we propose here, aims to provide more of this required information. 2 Finally, previous work investigated the reasons why incorporating source syntax helps RNNs (Shi et al., 2018) and Transformers (Pham et al., 2019;Sachan et al., 2020). These works show evidence that similar gains can be obtained when incorporating either syntactic trees or non-syntactic, syntactically uninformative, ones. A hypothesis followed, that graph-like architectures are helpful, but that syntactic information is redundant. While GCN creates such an architecture, linearized syntax, arguably PARENT and to some extent the labels GCN component, do not. Still, they allow gains over the vanilla decoder, which challenges this hypothesis.

Conclusion
We presented a flexible method for constructing decoders capable of outputting trees and graphs. We show that the improved decoder achieves notable gains in syntactic generalization, and in some settings improves overall performance as well. Our proposal is based on two main modifications to the standard Transformer decoder: (1) autoregression on structure; (2) bidirectional attention in the decoder, which allows recomputing token embeddings in light of newly decoded tokens. Testing on two variants for the decoder, we find that they both show superior syntactic generalization abilities over the vanilla Transformer, and that the gap does not diminish with model size. The method is flexible enough to allow decoding into a wide variety of graph and tree structures.
Our work opens many avenues for future work. One direction would be to focus on conditional networks, training with (intentionally) noisy prefixes, randomly masking "predicted" spans during training (as done in masked language models, Devlin et al., 2019), and data augmentation through hard words or phrases rather than full sentences. Another direction might enhance bidirectionality by allowing "regretting" and changing past predictions. Finally, the work opens possibilities for better incorporating structure into language generators, of incorporating semantic structure and of enforcing meaning preservation (thus targeting hallucinations, Wang and Sennrich, 2020), by incorporating source and target structure together. Roee Aharoni and Yoav Goldberg. 2017b. Towards string-to-tree neural machine translation. In ACL. A From sequence-to-sequence to conditional Attention-based models are characterized by being state-less. They can, therefore, be viewed as conditional language models, namely as models for producing a distribution for the next word, given the generated prefix and source sentence It is possible to re-encode other information (not only the decoded output) into the decoder at each step, or predict only tokens of interest, rather than the complete sequence. It is also possible to change the source sentence partially or completely (e.g., adding noise to increase robustness), condition on additional information ( §4) and adjust this information during prediction (e.g. force predicted word characteristics). Nevertheless, the standard practice is to only re-encode past predictions. 3 Unlike RNNs, attention-based models do not inherently rely on past predictions in terms of inputs, weights and gradients. The only connection to past predictions is mediated through their re-encoding back into the decoder.
RNNs receive past states as inputs. Backpropagation through time sees the current network as connected to the previous networks supplying the state input. Thus, the gradients take into account past predictions as well.
In contrast, Transformers have gradients over representation of past words only if they are fed into the network. Unlike backpropagation through time, the preceding tokens can be changed, or even omitted (e.g., in a limited window size scenario). Specifically, in our case, preceding tokens may have different representations at each generation step.
To sum, the representation is updated to provide good representation for the current step, but it is not calculated over the actual network of the previous step. It is often the case, though, that the previous decoded words are auto-regressed and hence updated.
This architecture, therefore, allows more flexibility than RNNs. Still, Transformers are often thought about as an extension to RNNs, i.e., sequnce-to-sequence models. For that reason it is rare to find changes to the training schedule that incorporate more knowledge, change "past" information or translate only parts of a sentence with a network. With such methods, for example, one can dynamically force features of the next prediction (by a changing input) or augment learning by teaching the network only over hard cases. Such an approach may choose augmented data in a regular way, but stop the prediction at the part in the sentence one wishes the network to learn, or even teach it several alternatives with the same prefix.

B Experimental Setup
The code is adapted from the NEMATUS code repository  and will be released upon publication. All hyperparameters are either taken from the original suggestions or optimized for the vanilla Transformer and used as is for our suggested models.
Networks are all trained with batch size 128, embedding size 256, 4 decoder and encoder blocks, 8 attention heads (one of which might be a parent head §4), 90K steps (where empirically some saturation is reached. This is a relatively fair comparison (Popel and Bojar, 2018)), learning rate 1e −4 , 4K warm-up steps, Adam (Kingma and Ba, 2015) optimizer with beta 0.9 and 0.999 for the first and second moment and epsilon of 1e −8 . We use the standard (structure-unaware) Transformer encoder in all our experiments. Each model was trained on 4 NVIDIA Tesla M60 or RTX 2080Ti GPUs for approximately a week (2 for GCN architecture), large models on RTX6000.
Preprocessing includes truecasing, tokenization as implemented by Moses (Koehn et al., 2007) and byte pair encoding (Sennrich et al., 2016) without tying. Empty source or target sentences were dropped. In training, the maximum target sentence length is 40 non-transition tokens (BPE).
We used UDPipe English and German over UD 2.0 and Russian with 2.5 with syntagrus version.
In unreported trials, we found that whenever noisy and crawled data is used, filtering is crucial for even the baselines to show reasonable results. On full En-Ru (See §6.2), we filter unexpected languages by langID (Lui and Baldwin, 2012) and improbable alignment (p < −180) with FastAlign (Dyer et al., 2013). Overall, about half the sentences were filtered by those measures or length.
There were 4,066,323 training sentences after filtering En-De and 4,468,840 before. In En-Ru, there were 19,557,568 after and 37,948,456 before. The English challenge sets on books and news sizes are respectively, 1,188 and 11 reflexive, 3,953 and 17 particle, 191 and 8 prepositions stranding, and the German 2,628 and 261 reflexive and 7,584 and 232 particle. WMT dev and test sets are always of about 3K sentences in size.
We use chrF++.py with 1 word and beta of 3 to obtain chrF+ (Popovic, 2017) score as in WMT19 (Ma et al., 2019) and detokenized BLEU (Papineni et al., 2002) as implemented in Moses. We use two automatic metrics: BLEU as the standard measure and chrF+ as it was shown to better correlate with human judgments, while still being simple and understandable (Ma et al., 2019). Both metrics rely on n-gram overlap between the source and reference, where BLEU focuses on word precision, and chrF+ balances precision and recall and includes characters, as well as word n-grams.
Transitions. We made two practical choices when creating the transition graph. First, we deleted the root edge, as the root is not a word in the translation.
Second, we train only on projective parses.
This choice reduces noise due to the low reliability of current non-projective parsers (Fernández-González and Gómez-Rodríguez, 2018), while not losing many training sentences. We do note, however, that this choice is not without its risks: it might be less fitting for some languages in which non-projective sentences are common.
The transitions serve as the NMT vocabulary. There are 45 labels and two directions of connections, summing up to 90 new tokens. This hardly affects the standard vocabulary size, which usually consists of tens of thousands of tokens (Ding et al., 2019). We treat both token and transition predictions in the same way, and do not rescale their score as done in Stanojević and Steedman (2020). If anything, the need to memorize more should hurt performance, and so increased performance should come despite enlarging the vocabulary and not because of it. It is possible to split the tokens into directions and labels (summing to 47). This comes at the cost of lengthy sentences which increase training time and memory consumption. We did not experiment with other methods for encoding the transitions (e.g., embedding labels and edges separately).

C Mixup challenge
We follow the results of (Bisazza et al., 2021) that Transformers are able to learn languages with free order, given case markings. Given those findings, we wonder whether indeed Transformers are robust Vanilla PARENT Object 6 6 Subject 5 8 Both 10 13 to mixing the order where case marking exists.
To do that, we take lists of nouns and verbs to create simple sentences from. Then, we create three types of sentences, validated to be correct and convey the same meaning in both orders by an in-house annotator who is a native German speaker. Ones with both marked such as: den Ball bringt der Hund (lit. the dog brings the ball), ones with only the subject marked: das Pfred drängt der Hund (the dog urges the horse), and ones with only the object.
The three lists of sentences are: • Then, we switch the object and subject and calculate how often is the translation correct in terms of places. We disregard other errors such as choice of verb in English.
Interestingly, as seen in the results section §E, both networks are quite bad at it (although the syntactic variant is better).

D Results with the Large
We include the full results over the two larger models PARENT and the Vanilla. While overall results are comparable, PARENT consistently performs better on the challenge sets, often with large margins.