Dependency parsing with structure preserving embeddings

Modern neural approaches to dependency parsing are trained to predict a tree structure by jointly learning a contextual representation for tokens in a sentence, as well as a head–dependent scoring function. Whereas this strategy results in high performance, it is difficult to interpret these representations in relation to the geometry of the underlying tree structure. Our work seeks instead to learn interpretable representations by training a parser to explicitly preserve structural properties of a tree. We do so by casting dependency parsing as a tree embedding problem where we incorporate geometric properties of dependency trees in the form of training losses within a graph-based parser. We provide a thorough evaluation of these geometric losses, showing that a majority of them yield strong tree distance preservation as well as parsing performance on par with a competitive graph-based parser (Qi et al., 2018). Finally, we show where parsing errors lie in terms of tree relationship in order to guide future work.


Introduction
Dependency grammars are syntactic formalisms that represent the syntactic structure of a sentence as asymmetric binary grammatical relations among words (Tesnière, 1959;Hudson, 1984;Melcuk, 2003). An example dependency structure is given in Figure 1. Formally, a dependency structure is defined as a directed graph where words are vertices and relations are labelled directed edges (the arcs) between a child (the dependent) and its parent (the head). In practice, dependency structures considered for syntactic analysis are trees. Dependency trees have long been used to improve the performance of many NLP applications, including machine translation (Ding and Palmer, 2004;Menezes et al., 2010;Bastings et al., 2017), relation extraction (Kambhatla, 2004;Bunescu and Mooney, 2005;Miwa and Bansal, 2016;, and semantic role labeling (Hacioglu, 2004;Marcheggiani and Titov, 2017;He et al., 2018).
In order to assign the correct dependency tree to a sentence, dependency parsers are trained to correctly identify head-dependent relations between pairs of words. Modern neural approaches do so by jointly learning contextual feature representations for the tokens in a sentence, as well as a parsing decision function. This is the case for recent graphbased parsers (Zhang et al., 2017;Dozat et al., 2017;Mohammadshahi and Henderson, 2020, inter alia) where an encoder feature extractor is complemented by a score function predicting the likelihood of a word to be the head of another. However, whereas this joint learning strategy results in stateof-the-art performance, the representations learned by these parsers are opaque.
As a first step towards learning interpretable parser representations, here we take a different approach: in addition to learning-to-parse, we seek to learn representations from which tree distances between words in dependency trees can be recovered. This stems from one simple observation: previous approaches do not take into account the geometry of the tree they try to model. That is, parsers are unaware of the structural properties of the tree (e.g., distance between nodes, depth from root), and as such are not trained to explicitly preserve these properties. In this respect, our approach is aligned with recent work looking at these geometric properties in the context of probing the BERT (Devlin et al., 2019)  have shown that it is possible to recover approximate syntactic trees from BERT embeddings by a linear transformation trained to minimize the difference between predicted and ground-truth tree distances. Given these results we then ask: is it possible to extend this idea to directly train treeaware dependency parsers? We argue that using the geometric tree structure to embed the dependency trees enhances the interpretability of the learned representations In this paper, we show that this is indeed possible by casting dependency parsing as a tree embedding problem. Specifically, we view a dependency tree as a finite metric space, and compute head-dependent scores for all word pairs within a sentence as follows: Given a sentence s = (w 0 , . . . , w ns ), where w 0 is a special ROOT token, we compute a geometric tree embedding φ : {w 0 , . . . , w ns } → R m that maps tokens w i to m-dimensional vectors. Geometric properties of an ideal isometric tree embedding are used to define the functional form of head-dependent (edge) scores ψ(w i , w j ). Concretely, our approach predict dependency trees only from pairwise embedded node distances, which completely specify the score function. We consider this as a step in the direction of interpretable end-to-end dependency parsing. We start with a straight-forward application of Hewitt and Manning (2019) and consider a mean absolute error (MAE) loss that encourages the embedding φ to approximate an isometric embedding of the ground-truth tree T s into (R m , d R ).
In this paper, we use the squared Euclidean distance semi-metric (d 2 2 ) as in Hewitt and Manning (2019) and also consider the distance obtained from the 1 -norm (d 1 ) 1 . We show formally that any isometric d 2 2 embedding can be simply rotated to form an isometric d 1 embedding.
We learn the tree embedding φ and the edge score function ψ through end-to-end training, by incorporating geometric properties of dependency trees (in terms of distance and depth) in the form of geometric losses within a graph-based parser. As our base parser, we use a simplified version of the biaffine parser of . This setup allows us to directly compare the performance of our losses and score functions to the biaffine score function used in several state-of-the-art graph parsers. We propose three additional losses for training a dependency parser expressed explicitly through tree distances: a maximum likelihood estimation function, a margin-based loss function, and one based on cross-entropy.
Finally, we explore whether adding a soft global constraint on the isometry of the learned trees helps with parsing performance; to this end, we combine our novel loss functions with the MAE loss.
We evaluate our approach on 16 languages from different families. We complement unlabeled accuracy of head-dependent attachement scores (UAS) with a Spearman's ρ correlation between predicted and true distances (DSpr) to directly assess geometric properties of the output trees. We also provide labeled attachment scores (LAS) for completeness. Through extensive experimentation, we make the following observations: • All of our novel tree distance based losses outperform the MAE loss of Hewitt and Manning (2019) • All losses using the d 1 metric provide better distance preservation properties and dependency parsing performance than using the d 2 2 semi-metric.
• Five of the six loss combinations (using d 1 ) show both strong distance preservation properties and parsing performance, indicating that distance preservation can be obtained without trading off parsing performance. Only the maximum likelihood estimation loss (on its own) has poor distance preservation; however, we find that the combination of this loss with the MAE loss greatly improves distance preservation, while achieving similar or better parsing performance.
• We show that the majority of parsing errors are local in tree distance, with by far the most frequent incorrect head assignments being either true sisters or grandparents.
Our results in the direction of accurate dependency parser that closely preserve tree distances are encouraging.

Background: Metric Tree Embeddings
is the length of the shortest path between any u, v ∈ V . In this paper we consider tree embeddings, φ : V → R m , which map nodes v ∈ V to points φ(v) ∈ R m such that the mapping φ approximately preserves tree distance. That is, for all pairs In this section, we discuss the choice of d R and illustrate distortion free (i.e., isometric) embeddings φ : V → R m . These distortion free embeddings motivate the formulation of losses that we use for training suitable embeddings, as discussed in § 3.
To choose the distance measure d R in the embedding space, we note that for a sufficiently large dimension m: i) any tree can be embedded isometrically into 1 ; ii) any metric space (including trees) can be embedded into ∞ ; and iii) for p spaces, with 1 < p < ∞, trees can only be embedded with distortion (Linial et al., 1995). The power transform of the Euclidean distance d 2 (x, y) c , with c ≥ 2, allows for isometric embedding of trees (Reif et al., 2019). However, the squared Euclidean distance d 2 (x, y) 2 does not satisfy the triangle inequality 3 and therefore is only a semi-metric. Nevertheless, both (R m , d 1 ) and (R m , d 2 2 ) are natural choices for embedding spaces for trees and, in this paper, we restrict our attention to these.
We follow Reif et al. (2019) to explicitly construct squared Euclidean embeddings. Specifically, all distortion free embeddings into (R m , d 2 2 ) can be simply expressed in terms of the edge displacement vectors {z i } |E| i=1 , where z i ∈ R m is the displacement between the embedded endpoints of edge e i ∈ E (i.e., z := φ(c) − φ(p), where c, p ∈ V are a pair of child and parent nodes). For an isometric embedding, it turns out we require these z i 's to be 2 Metric spaces are 2-tuples (X, dX ) consisting of a set of elements X and a metric dX : X × X → [0, ∞) quantifying notion of distance between any pair of elements of X.
3 Consider three points on a line with successive pairs separated by a d 2 2 distance of 1. Then the outer two are separated by an d 2 2 of 2 2 = 4, which is larger than the sum of the distances between the successive pairs. orthonormal, that is where Z ∈ R m×|E| is the matrix having z i as the i th column and 1 denotes the |E| × |E| identity matrix. In addition, it is useful to define ρ(u, v) to denote the shortest path between two vertices u, v ∈ V . And, finally, define the indicator vector b(u, v) ∈ Z |E| such that b i (u, v) = 1 when edge e i is on the shortest path between u and v, and b i (u, v) = 0 otherwise. With this notation we have the following theorem: Theorem 2.1. Pythagorean Embeddings. Given a rooted tree (V, E, r), where r ∈ V denotes the root, then for any m ≥ |E| there exists an embedding φ : V → R m such that, for some matrix Z ∈ R m×|E| with orthonormal columns (i.e., Eqn. (1) is satisfied).
See Appendix B-D for an example that demonstrates this construction along with proofs.
The edge-on-path indicator vectors b(r, v) provide some additional intuition about these squared Euclidean embeddings. Specifically, for v, w ∈ V we have Here (5) follows from (1), (3) and (4). Therefore g is an isometric 1 embedding which expresses tree distance in terms of the 1 norm of the difference between two path vectors b. One interesting consequence of Theorem 2.1, along with Eqn.'s (4) and (6), is: Corollary 2.1. 1 Embeddings. Given any isometric embedding φ : V → R m using the squared Euclidean distance, the embedding g(v) = Z T φ(v) is an isometric embedding of the same tree (V, E, r) into (R |E| , d 1 ). Here Z is as described in (3).
That is, any distance preserving tree embedding using the squared Euclidean norm can simply be rotated to an isometric 1 tree embedding. 4 Note the converse of Cor. 2.1 is not true. For example, three equally spaced points on a line form an 1 embedding that cannot be linearly transformed to a d 2 2 embedding. Indeed, it is shown in (Aksoy et al., 2020) that a finite metric tree can be embedded into (R m , d 1 ) if and only if it has at most 2m leaves. Thus, for a fixed dimensional embedding space R m , the metric d 1 allows for more trees to be isometrically embedded than d 2 2 . In this paper we use an embedding dimension m that is larger than the number of edges and thus isometric tree embeddings are feasible using both d 2 2 and d 1 . The learning problem considered in this paper is, given a sentence s, we seek to embed each of the sentence's tokens into (R m , d R ) such that the embedded tree is nearly isometric to the dependency parse tree for s. Here we evaluate using both d 1 and d 2 2 for d R .

Geometric Losses for Tree Embeddings
Given a sentence s = (w 0 , . . . , w ns ), where w 0 is a special ROOT token inserted at the beginning of every sentence, dependency parsing seeks to recover the correct dependency tree T s = (V s , E s ).
For simplicity, we label the tree nodes in V s with the tokens themselves, so V s = {w 0 , . . . , w ns }. A geometric tree embedding maps tokens w i within a sentence s to embedded points v i = φ(i, s) ∈ R m (for brevity, we drop the dependence on the whole sentence s and simply write φ(w i )). In § 2, we describe the exact geometry of the isometric embeddings using the d 2 2 semi-metric. For d 1 isometric embeddings we show one sub-class are simply rotations of isometric d 2 2 embeddings, but there are other forms. This section examines the use of auxiliary losses on the embedding φ that encourage approximately isometric embeddings. We expect an approximation of this geometry holds in d 2 2 when the losses are sufficiently small. Since we do not have a similar proof of necessity as in Appendix D for d 1 embeddings, the local geometry is more open. Given such an embedding, we then follow firstorder graph-based dependency parsers (McDonald et al., 2005) which compute a pairwise score ψ(v i , v j ) that indicates how likely it is for w j to be the head of w i . These scores provide edge weights on a fully connected embedded graph on φ(V s ) × φ(V s ). Having trained a network to compute suitable embeddings v i = φ(w i ) and edge weights ψ(v i , v j ), parsing then amounts to finding the maximum spanning tree in this weighted graph that is rooted at v 0 ; a detailed description of the parser architecture is provided in § 4.
Given this general approach, a natural choice for an auxiliary loss on φ is to consider the mean absolute error (MAE) in the distances between the embeddings of any two nodes: where v i = φ(w i ) (we drop the subscripts s for brevity). This MAE loss treats the distance errors for all pair of nodes as equally important, and we therefore refer to it as a global loss. Note the loss in (7) is zero only when the embedding φ is an isometric tree embedding with respect to d R . We next consider the edge scoring function ψ, whose role is to assign costs to proposed headdependent pairs (w i , w j ). We denote this groundtruth head-dependent relation by (w i , w j ) (which is true when w j is the head of w i in T ) and note it can be defined only in terms of the distance Each node w i then has a unique head w j defined by (8), except for the ROOT w 0 which has none. Given the embedding φ is providing a near-isometric tree, we define an edge scoring function in the embedding space, namely ψ(v i , v j ), by rewriting (8) in the following probabilistic form: where τ is a temperature parameter. Here we refer to φ ij as the "head-dependent cost" for the pair w i and w j .
A natural loss on both the embedding φ : V → R m and the edge-cost ψ( where p φ (w j |w i ) is defined in (13).
As an alternative to the MLE loss we consider a margin based approach where the task is to minimize φ ij for the true head j, subject to φ ik ≥ φ ij +α for all k ∈ {0, . . . , n}\{i, j}, where α > 0 is a margin. 5 We explore such a margin-based approach using the soft triplet loss (Sohn, 2016) As a third alternative for the head-dependent loss we consider the cross-entropy between the probability distribution p φ (w j |w i ) and the corresponding distribution p T (w i |w j ) formed by using an isometric embedding φ T in Eqns. (9 -13). Note that for an isometric embedding φ T we have , and this can be used to simplify the resulting expression. Specifically, we find p T (w j |w i ) depends only on the true tree distance d T and not on the details of φ T . The crossentropy loss is then In summary, we investigate the choice between several different head-dependent losses (i.e., Eqn.'s (14), (15), or (16), and optionally combine each of these with the explicit global MAE loss (7).

Model
We use a simplified version of the Biaffine dependency parser of . 6 First, we give an overview of the Biaffine parser, and then describe our modifications. Biaffine is composed of a 5 Note that an isometric embedding φ provides one solution for which φ ij = 0 for all head-dependent pairs, and φ ij ≥ 2 otherwise. 6 We use the codebase provided at https://github. com/stanfordnlp/stanfordnlp. highway-BiLSTM encoder (Srivastava et al., 2015) that takes as input a sequence of n s +1 embeddings x 0 , . . . , x ns , where each x i is a concatenation of word-level, character-level, part-of-speech and morphological feature embeddings. We use pre-trained word2vec (Mikolov et al., 2013) when available, and fastText embedding (Bojanowski et al., 2017) otherwise. We train the rest of the embeddings from scratch.
Given such an input sequence, the Biaffine parser predicts the most likely head for each word (referred to as unlabelled attachment prediction), along with the grammatical relation between each pair of head and dependent words (labelled attachment prediction). Biaffine first calculates contextual embeddings h i (through the encoder), and then projects these into separate head and dependent representations for each word (through two separate MLP networks): where h head i and h dep i are the head and dependent representations. Next, for each pair of words w i and w j , a head-dependent score s ij and a corresponding probability p(w j |w i ) are calculated with a learnable biaffine weight U: Our geometric tree embedding φ computes a single representation for each node, as such we replace the separate head and dependent MLP networks with a single MLP network 7 : Given v 0 , . . . , v n , head-dependent scores s ij are defined as in: where ψ is calculated as in Eqn. (12). We obtain asymmetry in our score function ψ(v i , v j ) from the depth difference term |∆ φ ij − 1| in Eqn. (11).
During inference, we use the Chu-Liu-Edmonds algorithm (Chi, 1999;Edmonds, 1967) to find the highest-scoring dependency tree. While our main focus is embedding unlabeled trees for dependency relation prediction, for completeness we also report results on labeled dependency tree prediction. We use the same classifier as Biaffine -with the same setting as described in  -to estimate the probabilities of dependency labels l ij for a given head-dependent pair w i and w j : where s rel ij ∈ R K is a K-dimensional vector containing the dependency relation scores for each of the K dependency labels. Performance Measures. We report the overall accuracy of head and relation predictions for all tokens in the test portion of the data sets. Given our model prediction and a reference parse for a given input, accuracy is calculated using two standard measures: Unlabelled Attachment Score (UAS), that is the percentage of tokens that are assigned the correct head; and Labelled Attachment Score (LAS) that is the percentage of tokens that are assigned the correct head and the correct grammatical relation. We use UAS for model selection.

Experimental
To assess how well the learned tree embeddings preserve distances, we follow Hewitt and Manning (2019) and Hall Maudslay et al. (2020), and measure the correlation between the learned and ground-truth tree distances. Specifically, for all words in all sentences, we compute Spearman's ρ between predicted and true distances. We first average the correlation coefficients for sentences  of the same length. We report the macro-average over these averages for sentences of length 5-50, referred to as DSpr.
Hyperparameters. We adopt the same hyperparameter configuration as in the original Biaffine model  up to the BiLSTM layer for the head-dependent classifier, and the same configuration for the entire dependency label classifier. We perform grid search on the remaining hyperparameters and select best hyperparameter configurations based on UAS on the development portion of the English-EWT data. Based on the results, we set the margin α = 3 for L α , and the temperature τ = 1 for L M LE and τ = 0.2 for L CE . In our evaluation, we run experiments that involve the combination of any of L α , L CE , and L M LE with L M AE as an auxiliary loss, with a coefficient λ 1 . We find the best values for λ 1 to be 0.2 for L α , and 0.1 for both L CE and L M LE . We refer the reader to Appendix A for the full list of hyperparameters and training details.

Metric Spaces and Geometric Losses
We first verify whether the choice of metric space impacts performance. Table 1 reports UAS (and DSpr) on the development portion of the English-EWT corpus, for tree embeddings in both (R m , d 1 ) and (R m , d 2 2 ) metric spaces (referred to as d 1 and d 2 2 for brevity, respectively). The first row shows results when trained with L M AE only. It learns an embedding that provides good approximation of global tree distances using d 2 2 (DSpr: 0.89), which is similar to the findings reported by Hewitt and Manning (2019), but is suboptimal in terms of parsing (UAS: 90.28). On the other hand we can see that all head-dependent losses achieve stronger parsing performances in both spaces, with the proper metric d 1 leading to higher UAS and DSpr scores across the board (including L M AE ) when compared to d 2 2 . We will then report results in the rest of this section only with the d 1 metric, unless otherwise stated. Table 1 also shows a comparison between the different head-dependent losses, with and without the auxiliary loss L M AE that further constrains φ to be globally isometric. In isolation, L CE yields the best UAS while L α the best DSpr; interpolating the losses with L M AE improves results for L M LE and L CE , especially in terms of DSpr. This is in line with our expectations that the auxiliary loss encourages the parser to learn an embedding that more faithfully preserves global tree distances.
L M LE seeks to correctly identify all headdependent relations by maximizing the probability p φ (w j |w i ) of true head-dependent pairs with no explicit constraints on global isometry; see Eqn. (14). We thus hypothesize that L M LE can learn an embedding that produces good UAS, but is far from being isometric to the ground-truth tree. We verify this empirically: Figure 2(a) shows that the addition of L M AE greatly regulates the embedding distances of the trees produced by L M LE , hence improving the DSpr score for this loss. 8 L CE on the other hand seeks to match p φ (w j |w i ) with p T (w j |w i ), which encourages all embedding distances to be correlated with tree distances; see Eqn. (16). However, the p T (w j |w i ) term in this loss gives higher weights to word pairs that are closer in terms of ground-truth tree distance and therefore the model is trained to focus more on preserving short distances. 9 Figure 2(b) confirms that 8 We observe that the optimized LMLE (without MAE) is lower than that produced by an isometric embedding on the development portion of English-EWT, indicating that the observed distortion of tree distances is an overfitting issue. 9 The local emphasis is stronger with lower temperature. We chose temperature τ = 0.2 based on UAS on the develop-the embedding obtained using L CE underestimates the ground-truth tree distances, and adding the auxiliary MAE loss helps regulate this distortion of tree distances.
§ 2 describes the exact geometry of the embeddings that out model learns using the d 2 2 semimetric in the special case that L M AE is reduced to zero. In Figure 2(b), we observe small losses up to tree distance five when d 2 2 is used with L CE and L M AE . Therefore we expect the local geometry of d 2 2 trees to approximately follow Eqn. (3).
Overall, for both losses we observe that, on average, d 2 2 embeddings lead to predicted tree distances that have large variances, as well as medians further away from the ground-truth, whereas the d 1 embeddings are more stable and accurate. This agrees with the observations in Table 1 that embeddings in d 1 have better DSpr than embeddings in d 2 2 for all losses we considered. The same comparison between d φ and d T for L α is provided in Appendix E.

Comparison with Qi et al. (2018)
We compare the parsing results for the six geometric loss combinations against the biaffine parser of  for all treebanks in Table 2. For a fair comparison, we re-run all experiments and report our results for the biaffine parser. We report UAS for models trained without the dependency label prediction loss in order to focus on unlabeled tree structure and LAS for completeness. In general, the parsing performance is stable across ment portion of English-EWT. However, we find a temperature of 1 to provide better approximation of global distances, especially for tree distances less than five (see Appendix E and Figure 6).   different languages for all geometric losses. Overall, L CE+M AE is our best performing model and the average UAS/LAS across languages are on par with the Biaffine parser in spite of only having a single representation for each token. Moreover, it achieves top performance on Czech, Hindi and Turkish. All other geometric losses also achieve competitive results that are within 1% of the Biaffine parser for both UAS and LAS. We also report the mean DSpr along with standard deviation across the 16 treebanks in Table 3. Unlike the UAS and LAS we find a substantial difference in the DSpr score for the loss L M LE . When combined with the auxiliary loss L M AE we find a pronounced increase in DSpr, this agrees with the findings in § 5.1.

Parsing Errors w.r.t. Tree Relationships
Inspired by the geometric structure of tree embeddings, we investigate the sources of errors in terms of ground-truth dependency trees. Given a sentence s and a dependency tree T s , we define the type of relation between a pair of token (w i , w j ) by a 2-tuple consisting the distance d Ts (w i , w j ) and depth difference ∆ s (w i , w j ) = d Ts (w 0 , w i ) − d Ts (w 0 , w j ). This definition follows naturally from the geometric interpretation of trees: for a node w i , (1, -1) defines its children, (2, 0) defines its sisters, and (2, 2) defines its grandparents.
To visualize the distribution of errors, for each trained model, we plot the percentage of wrong edges for each relation type on the development set of English-EWT. We show an example plot in Figure 3 for our best L CE+M AE , with results for other losses provided in Appendix F. Surprisingly, we do not identify long distance ambiguities as a major source of errors (i.e., 99.8% of UAS errors have incorrectly assigned a head node that is within a tree distance of 5 from the dependent). Moreover, we find that sisters and grandparents account for 36.2% and 34.8% of all the UAS errors, respectively. To put these results into context, we construct a synthetic random error distribution: for each tree in the English-EWT development set we generate all trees with a single attachment error. We observe 81.6% of the errors to be local (up to distance 5) and sisters account for 22.5% and grandparents only account for 5.4% of the errors. We further compare with the biaffine parser and find 99.5% of UAS errors are local with 40.2% for sister and 32.7% for grandparent errors. Therefore parsing errors for a trained parser are dominated by errors that are more local in terms of tree distance than expected from a uniform error distribution. One immediate question that may arise is how can we reduce these specific highfrequency errors. One intuitive extension of the current work is to modify the formulation of edge scores in Eqn. (12) to push the decision boundary away from sisters or grandparents during inference. By training to explicitly model the geometry of the tree, our approach is one step closer towards addressing specific high-frequency errors.

Conclusions
In this work, we propose to use the geometry of (dependency) tree structures to construct a neural dependency parser that improves the interpretability of the learned representations without compromising parsing performance. We propose several geometric loss functions, and show that for a majority of them, our simple network learns distancepreserving embeddings through end-to-end training. In doing so, we also compare squared Euclidean distance (d 2 2 ) with the distance obtained from 1 -norm (d 1 ), as the (semi-)metric in the embedding space R m , and provide empirical evidence for using the proper d 1 metric. We compare our results with a competitive and widely-used graphparser proposed by  on 16 languages from different families, and show overall parser performances that are on par with it. Our experiments also suggest a new way of looking at the sources of parsing errors in terms of tree distances, and show that the majority of errors are local (e.g., sisters or grandparents).
For future work, we suggest looking at potential ways to correct such high-frequency head prediction errors, defined by their relationship within a tree. Another interesting direction that's worth exploring is to use the continuous tree distances predicted by our methods as features for downstream tasks instead of the discrete tree structures produced by conventional parsers. As recent work has been exploring, this differentiable representation of tree structure is potentially useful within the iterative-refinement framework (Mohammadshahi and Henderson, 2020), or as additional tree-specific positional features in a transformer (Omote et al., 2019).

A Hyperparameters
We use 2 layers of MLP with leaky-relu as activation to map the biLSTM outputs into a 800dimensional embedding space 10 . The layer weights are initialized with the values from uniform distribution U(−0.05, 0.05), and biases are initialized to zero. We train the models with Adam (Kingma and Ba, 2015) with an initial learning rate of 0.001, β 1 = 0.9, β 2 = 0.95, and = 1e − 8 for up to 50, 000 iterations, where each iteration is a batch of up to 5000 tokens or the maximum number of tokens we can fit in the GPU memory. We evaluate the models every 100 steps and save them only if we see improvement in UAS on development data. We switch to AMSGrad (Reddi et al., 2018) after 3000 iterations with no observed improvement on development set UAS, at which point we terminate training when another 3000 iterations pass without improving development set UAS. Let us take as a working example the binary tree on Figure 4. The embedding of node v is the embedding of the root r plus the sum of all direction vectors on the path ρ(r, v), from r to v.
If can take f (r) to be 0 or any random vector then embedding of all nodes in the tree are by definition: f (e) = f (r) + u 2 + u 5 f (h) = f (r) + u 2 + u 6 10 The maximum length of any sentence in the dataset is smaller than 800 and thus an isometric tree embeddings are feasible using both d 2 2 and d1 We took an arbitrary root embedding f (r) and n − 1 orthogonal unit-length vectors u i . If the u i -s are the standard unit basis vectors, for example, then the tree is embedded on the edges of the unit cube and it is an isometric embedding for both d 2 2 and d 1 .
C Proof of Thm. 2.1, Sufficiency Proof. Eqn. (3) sufficient. We first show that an embedding of the form given in Eqn.
(3) necessarily satisfies (2). This is Thm. 1 in Reif et al. (2019) and here we find it useful to expand on their proof to introduce notation and assist the reader.
Let T = (V, E, r), Z and b(r, u) be as described in § 2. Then Eqn. (3) is simply where the notation {i ∈ ρ(r, v)} is short for {i | e i ∈ ρ(r, v)}, which is the set of i's for which b i (r, v) = 1. Consider any two vertices v and w. Notice the two paths ρ(r, v) and ρ(r, w) must share a common prefix, namely ρ(r, a), the sub-path from r to the lowest common ancestor a of v and w, with the remaining paths ρ(a, v) and ρ(a, w) being edge disjoint. Therefore = |ρ(a, v)| + |ρ(a, w)| (29) = |ρ(v, w)| ≡ d T (v, w).
Here we canceled the common term f (r) and common prefix edges to derive Eqn. (27) and (28). Eqn.
(29) follows from the fact that paths ρ(a, v) and ρ(a, w) are edge disjoint and the orthonormality of the z i 's. Here |ρ(a, v)| denotes the number of edges on the a to v path. Finally, (30) follows since a is the least common ancestor of v and w.

D Proof of Thm. 2.1, Necessity
We provide a proofs that any isometric d 2 2 embedding must have the form defined in (3) that relies only on linear algebra and may provide the reader with additional intuition about the construction of d 2 2 embeddings.
Proof. First, for the case |V | ≤ m + 1, we use induction to show that any isometric d 2 2 embedding f must have the form described in Eqn. (3). Then we show that for |V | > m + 1, such an f cannot exist.
We use induction to prove that if f : V → R m is an isometric embedding for any k ≤ m + 1 then (3) must hold and B must be full rank. Note this statement is trivially true for k = 1 and 2.
Let k ≥ 2 and k ≤ m. Let T = (V, E, r) be a tree of size |V | = k. Consider the induction hypothesis that any isometric embedding (using d 2 this child (so |V | = m + 1) must have the form described in (3) with a full rank m × m path matrix B and orthonormal m×m matrix Z. The same line of reasoning then shows that z := f (c) − f (p) must satisfy (35). But here Z is full rank and so z = 0, which contradicts the constraint that the (c, p) edge in the embedding must have length 1.