4 and 7-bit Labeling for Projective and Non-Projective Dependency Trees

,


Introduction
Approaches that cast parsing as sequence labeling have gathered interest as they are simple, fast (Anderson and Gómez-Rodríguez, 2021), highly parallelizable (Amini and Cotterell, 2022) and produce outputs that are easy to feed to other tasks (Wang et al., 2019).Their main ingredient are the encodings that map trees into sequences of one discrete label per word.Thus, various such encodings have been proposed both for constituency (Gómez-Rodríguez and Vilares, 2018;Amini and Cotterell, 2022) and dependency parsing (Strzyz et al., 2019;Lacroix, 2019;Gómez-Rodríguez et al., 2020).
Most such encodings have an unbounded label set, whose cardinality grows with sentence length.An exception for constituent parsing is tetratagging (Kitaev and Klein, 2020).For dependency parsing, to our knowledge, no bounded encodings were known.Simultaneously to this work, Amini et al. (2023) have just proposed one: hexatagging, where projective dependency trees are represented by tagging each word with one of a set of 8 tags. 1 1 The "hexa" in the name comes from a set of six atoms Contribution We present a bounded sequencelabeling encoding that represents any projective dependency tree with 4 bits (i.e., 16 distinct labels) per word.While this requires one more bit than hexatagging, it is arguably more straightforward, as the bits directly reflect properties of each node in the dependency tree without an intermediate constituent structure, as hexatagging requires.Also, it has a clear relation to existing bracketing encodings, and has a straightforward non-projective extension using 7 bits with almost full non-projective coverage.Empirical results show that our encoding provides more accurate parsers than the existing unbounded bracketing encodings, which had the best previous results among sequence-labeling encodings, although it underperforms hexatagging.

Projective Encoding
Let T n be a set of unlabeled dependency trees2 for sentences of length n.A sequence-labeling encoding defines a function Φ n : T n → L n , for a label set L. Thus, each tree for a sentence w 1 . . .w n is encoded as a sequence of labels, l 1 . . .l n , that assigns a label l i ∈ L to each word w i .We define the 4-bit projective encoding as an encoding where T n is the set of projective depenused to define the labels.However, the label for each word is composed of two such atoms (one from a set of two, and other from a set of four) so there are eight possible labels per word.
dency trees, and we assign to each word w i a label l i = b 0 b 1 b 2 b 3 , such that b j is a boolean as follows: • b 0 is true if w i is a right dependent, and false if it is a left dependent.Root nodes are considered right dependents for this purpose (i.e., we assume that they are linked as dependents of a dummy root node w 0 located to the left).
• b 1 is true iff w i is the outermost right (or left) dependent of its parent node.
• b 2 (respectively, b 3 ) is true iff w i has one or more left (right) dependents.
All combinations of the four bits are possible, so we have 16 possible labels.
For easier visualization and comparison to existing bracketing encodings, we will represent the values of b 0 as > (right dependent) or < (left dependent), b 1 as * (true) or blank (false), and b 2 and b 3 respectively as \ and / (true) or blank (false).We will use these representations with set notation to make claims about a label's bits, e.g.>* ∈ l means that label l has b 0 = 1, b 1 = 1. Figure 1 shows a sample tree encoded with this method.
We will now show how to encode and decode trees, and prove that the encoding is a total, injective map from projective trees to label sequences.
Encoding and Totality Encoding a tree is trivial: one just needs to traverse each word and apply the definition of each bit to obtain the label.This also means that our encoding from trees to labels is a total function, as the labels are well defined for any dependency tree (and thus, for any projective tree).
Decoding and Injectivity Assuming a wellformed sequence of labels, we can decode it to a tree.We can partition the arcs of any tree t ∈ T n into a subset of left arcs, t l , and a subset of right arcs, t r .We will decode these subsets separately.Algorithm 1 shows how to obtain the arcs of t r .
The idea of the algorithm is as follows: we read labels from left to right.When we find a label containing /, we know that the corresponding node will be a source of one or more right arcs.We push it into the stack.When we find a label with >, we know that its node is the target of a right arc, so we link it to the / on top of the stack.Additionally, if the label contains *, the node is a rightmost sibling, so we pop the stack because no more arcs will be return a 17: end function created from the same head.Otherwise, we do not pop as we expect more arcs from the same origin. 3ntuitively, this lets us generate all the possible non-crossing combinations of right arcs: the stack enforces projectivity (to cover a / label with a dependency we need to remove it from the stack, so crossing arcs from inside the covering dependency to its right are not allowed), and the distinction between > with and without * allows us to link a new node to any of the previous, non-covered nodes.
To decode left arcs, we use a symmetric algorithm DecodeLeftArcs (not shown as it is analogous), which traverses the labels from right to left, operating on the elements \ and < rather than / and >; with the difference that the stack is not initialized with the dummy root node (as the arc originating in it is a right arc).By the same reasoning as above, this algorithm can obtain all the possible non-crossing configurations of left arcs, and hence the mapping is injective.The decoding is trivially linear-time with respect to sequence length.
A sketch of an injectivity proof can be based on showing that the set of right arcs generated by Algorithm 1 (and the analogous for left arcs) is the only possible one that meets the conditions of the labels and does not have crossing arcs (hence, we cannot have two projective trees with the same encoding).To prove this, we can show that at each iteration, the arc added by line 7 of Algorithm 1 is the only possible alternative that can lead to a legal projective tree (i.e., that s.peek() is the only possible parent of node i).This is true because (1) if we choose a parent to the left of s.peek(), then we cover s.peek() with a dependency, while it has not yet found all of its right dependents (as otherwise it would have been popped from the stack), so a crossing arc will be generated later; (2) if we choose a parent to the right of s.peek() and to the left of i, its label must contain / (otherwise, by definition, it could not have right dependents) and not be on the stack (as the stack is always ordered from left to right), so it must have been removed from the stack due to finding all its right dependents, and adding one more would violate the conditions of the encoding; and finally (3) a parent to the right of i cannot be chosen as the algorithm is only considering right arcs.Together with the analogous proof for the symmetric algorithm, we show injectivity.
Coverage While we have defined and proved this encoding for projective trees, 4 its coverage is actually larger: it can encode any dependency forest (i.e., does not require connectedness) such that arcs in the same direction do not cross (i.e., it can handle some non-projective structures where arcs only cross in opposite directions, as the process of encoding and decoding left and right arcs is independent).This is just like in the unbounded bracketing encodings of (Strzyz et al., 2019), but this extra coverage is not very large in practice, and we will define a better non-projective extension later.
Non-surjectivity Just like other sequencelabeling encodings (Strzyz et al., 2019;Lacroix, 2019;Strzyz et al., 2020, inter alia), ours is not surjective: not every label sequence corresponds to a valid tree, so heuristics are needed to fix cases where the sequence labeling component generates an invalid sequence.This can happen regardless of whether we only consider a tree to be valid if it is projective, or we accept the extra coverage mentioned above.For example, a sequence where the last word is marked as a left child (<) is invalid in either case.Trying to decode invalid label sequences will result in trying to pop an empty stack or leaving material in the stack after finishing Algorithm 1 or its symmetric.In practice, we can 4 If we did not consider a dummy root, we would be able to cover planar trees, rather than just projective trees (as in the bracketing of (Strzyz et al., 2019)), but this would require an extra label for the sentence's syntactic root.Instead, we use a dummy root on the left and explicitly encode the arc from it to the syntactic root, which is thus labeled as a right child instead of using an extra label.This simplifies the encoding, and the practical difference between the coverage of projectivity and planarity is small (Gómez-Rodríguez and Nivre, 2013).skip dependency creation when the stack is empty, ignore material left in the stack after decoding, break cycles and (if we require connectedness) attach any unconnected nodes to a neighbor.

Non-Projective Encoding
For a wider coverage of non-projective dependency trees (including the overwhelming majority of trees found in treebanks), we use the same technique as defined for unbounded brackets in (Strzyz et al., 2020): we partition dependency trees into two subsets (planes) of arcs (details in Appendix D), and this lets us define a 7-bit non-projective encoding by assigning each word w i a label l i = (b 0 . . .b 6 ), where: • b 0 b 1 can take values <0 (w i is a left dependent in the first plane), >0 (right dependent in the 1 st plane), <1 or >1 (same for the 2 nd plane).
• b 2 is true iff w i is the outermost right (or left) dependent of its parent (regardless of plane).
We represent it as * if true or blank if false.
• b 3 (respectively, b 4 ) is true iff w i has one or more left (right) dependents in the first plane.We denote it as \0 (/0) if true, blank if false.
• b 5 and b 6 are analogous to b 3 and b 4 , but in the second plane, represented as \1 or /1.
Every 7-bit combination is possible, leading to 128 distinct labels.Figure 2 shows an example of a non-projective tree represented with this encoding.
The encoding is able to cover every possible dependency tree whose arc set can be partitioned into two subsets (planes), such that arcs with the same direction and plane do not cross.
This immediately follows from defining the decoding with a set of four algorithms, two for decoding left and right arcs on the first plane (defined as Algorithm 1 and its symmetric, but considering only the symbols making reference to arcs in the first plane) and other two identical decoding passes for the second plane.With this, injectivity is shown in the same way as for the 4-bit encoding.Decoding is still linear-time.
Note that the set of trees covered by the encoding, described above, is a variant of the set of 2-Planar trees (Yli-Jyrä, 2003;Gómez-Rodríguez and Nivre, 2010), which are trees that can be split into two planes such that arcs within the same plane do not cross, regardless of direction.Compared to 2-Planar trees, and just like the encodings in (Strzyz et al., 2020), our set is extended as it allows arcs with opposite directions to cross within the same plane.However, it also loses some trees because the dummy root arc is also counted when restricting crossings, whereas in 2-Planar trees it is ignored.

Experiments
We compare our 4-bit and 7-bit encodings to their unbounded analogs, the bracketing (Strzyz et al., 2019) and 2-planar bracketing encodings (Strzyz et al., 2020) which overall are the best performing in previous work (Muñoz-Ortiz et al., 2021).We use MaChAmp (van der Goot et al., 2021) as a sequence labeling library, with default hyperparameters (Appendix B).We use XLM-RoBERTa (Conneau et al., 2020) followed by two separate onelayered feed-forward networks, one for syntactic labels and another for dependency types.We evaluate on the Penn Treebank Stanford Dependencies 3.3.0conversion and on UD 2.9: a set of 9 linguistically-diverse treebanks taken from (Anderson and Gómez-Rodríguez, 2020), and a lowresource set of 7 (Anderson et al., 2021).We consider multiple subsets of treebanks as a single subset could be fragile (Alonso-Alonso et al., 2022).
Table 1 compares the compactness of the encodings by showing the number of unique syntactic labels needed to encode the (unlabeled) trees in the training set (i.e. the label set of the first task).The new encodings yield clearly smaller label set sizes,  as predicted in theory.In particular, the 4-bit encoding always uses its 16 distinct labels.The 7-bit encoding only needs its theoretical maximum of 128 labels for the Ancient Greek treebank (the most non-projective one).On average, it uses around a third as many labels as the 2-planar bracketing encoding, and half as many as the basic bracketing.Regarding coverage, the 7-bit encoding covers over 99.9% of arcs, like the 2-planar bracketing.
The 4-bit encoding has lower coverage than basic brackets: both cover all projective trees, but they differ on coverage of non-projectivity (see Appendix C for an explanation of the reasons).More detailed data (e.g.coverage and label set size for low-resource treebanks) is in Appendix A.
Table 2 shows the models' performance in terms of LAS.The 4-bit encoding has mixed performance, excelling in highly projective treebanks like the PTB or Hebrew-HTB, but falling behind in non-projective ones like Ancient Greek, which is consistent with the lower non-projective coverage.The 7-bit encoding, however, does not exhibit this problem (given the almost total arc coverage mentioned above) and it outperforms both baselines for every treebank: the basic bracketing by 1.16 and the 2-planar one by 1.96 LAS points on average. 5f we focus on low-resource corpora (Table 3), label set sparsity is especially relevant so compact-ness further boosts accuracy.The new encodings obtain large improvements, the 7-bit one surpassing the best baseline by over 3 average LAS points.

Additional results: splitting bits and external parsers
We perform additional experiments to test implementation variants of our encodings, as well as to put our results into context with respect to nonsequence-labeling parsers and simultaneous work.
In the previous tables, both for the 4-bit and 7-bit experiments, all bits were predicted as a single, atomic task.We contrast this with a multi-task version where we split certain groups of bits to be predicted separately.We only explore a preliminary division of bits.For the 4-bit encoding, instead of predicting a label of the form b 0 b 1 b 2 b 3 , the model predicts two labels of the form b 0 b 1 and b 2 b 3 , respectively.We call this method 4-bit-s.For the 7-bit encoding, we decided to predict the bits corresponding to each plane as a separate task, i.e., b 0 b 2 b 3 b 4 and b 1 b 5 b 6 .We call this method 7-bit-s.We acknowledge that other divisions could be better.However, this falls outside the scope of this paper.
We additionally compare our results with other relevant models.As mentioned earlier, alongside this work, Amini et al. ( 2023) introduced a parsingas-tagging method called hexatagging.In what follows, we abbreviate this method as 6tg.We implement 6tg under the same framework as our encodings for homogeneous comparison, and we predict these hexatags through two separate linear layers, one to predict the arc representation and another for the dependency type.We also consider a split version, 6tg-s, where the two components of the arc representation are predicted separately.For a better understanding of their method, we refer the reader to Amini et al. and Appendix E. Finally, we include a comparison against the biaffine graphbased parser by Dozat et al. (2017).For this, we trained the implementation in SuPar6 using xlmroberta-large as the encoder, which is often taken as a strong upper bound baseline.
Table 4 compares the performance of external parsers with our bit encodings.First, the results show that the choice of whether to split labels into components or not has a considerable influence, both for 6tg (where splitting is harmful across the board) and for our encodings (where it is mostly beneficial, perhaps because the structure of the encoding in bits with independent meanings naturally lends itself to multi-task learning).Second, on average, the best (multi-task) version of our 7-bit encoding is about 1.7 points behind the 6tg and 1.2 behind biaffine state-of-the-art parsers in terms of LAS.However, the difference between versions with and without multi-task learning suggests that there might be room for improvement by investigating different splitting techniques.Additionally, in Appendix F, Table 14 compares the processing speeds of these parsers (on a single CPU).In Appendix G, Tables 15 and 16 show how often heuristics are applied in decoding.Finally, Table 5 shows the external comparison on the low-resource treebanks, where our encodings lag further behind biaffine and especially 6tg, which surpasses 7-bit-s by over 5 points.

Conclusion
We have presented two new bracketing encodings for dependency parsing as sequence labeling, which use a bounded number of labels.The 4-bit encoding, designed for projective trees, excels in projective treebanks and low-resource setups.The 7-bit encoding, designed to accommodate non-projectivity, clearly outperforms the best prior sequence-labeling encodings across a diverse set of treebanks.The source code is available at https://github.com/Polifack/CoDeLin/releases/tag/1.25.

Limitations
In our experiments, we do not perform any hyperparameter optimization or other task-specific tweaks to try to bring the raw accuracy figures as close as possible to state of the art.This is for several reasons: (1) limited resources, (2) the paper having a mainly theoretical focus, with the experiments serving to demonstrate that our encodings are useful when compared to alternatives (the baselines) rather than chasing state-of-the-art accuracy, and (3) because we believe that one of the primary advantages of parsing as sequence labeling is its ease of use for practitioners, as one can perform parsing with any off-the-shelf sequence labeling library, and our results directly reflect this kind of usage.We note that, even under such a setup, raw accuracies are remarkably good.

Ethics Statement
This is a primarily theoretical paper that presents new encodings for the well-known task of dependency parsing.We conduct experiments with the sole purpose of evaluating the new encodings, and we use publicly-available standard datasets that have long been in wide use among the NLP community.Hence, we do not think this paper raises any ethical concern.

A Further Data
Tables 6 and 7 show treebank statistics for the general and low-resource set of treebanks, respectively.Table 8 shows the number of labels and the arc coverage of each considered encoding for the lowresource treebank set of Anderson et al. (2021) Tables 9 and 10 show the coverage of the encodings in terms of full trees, rather than arcs (i.e., what percentage of the dependency trees in each treebank can be fully encoded and decoded back by each of the encodings).Tables 11 and 12 show the total number of labels needed to encode the training set for each encoding and treebank, when considering full labels (i.e., the number of combinations of syntactic labels and dependency type labels).This can be relevant for implementations that generate such combinations as atomic labels (in our implementation, label components are generated separately instead).

B Hyperparameters
We did not perform hyperparameter search, but just used MaChAmp's defaults, which can be seen in Table 13.

C Coverage Differences
It is worth noting that, while the 7-bit encoding has exactly the same coverage as the 2-planar bracketing encoding (see Tables 1, 8, 9, 10); the 4-bit encoding has less coverage than the basic bracketing.As mentioned in the main text, both have full coverage of projective trees, but there are subtle differences in how they behave when they are applied to non-projective trees.We did not enumerate all of these differences in detail for space reasons.
In particular, they are the following: • Contrary to basic bracketing, the 4-bit encoding needs to encode the arc originating from the dummy root explicitly.This means that it cannot encode non-projective, but planar trees where the dummy root arc crosses a right arc (or equivalently, the syntactic root is covered by a right arc).
• In the basic bracketing, a dependency involving words w i and w j (i < j) is not encoded in the labels of w i and w j , but in the labels of w i+1 and w j (see (Strzyz et al., 2019)), as a technique to alleviate sparsity (in the particular case of that encoding, it guarantees that the worst-case number of labels is linear, rather than quadratic, with respect to sentence length).In the 2-planar, 4-and 7-bit encodings, this is unneeded so dependencies are encoded directly in the labels of the intervening words.
• Contrary to basic bracketing, in the 4-bit encoding a single / or \ element is shared by several arcs.Thus, if an arc cannot be successfully encoded due to unsupported nonprojectivity, the problem can propagate to sibling dependencies.In other words, due to being more compact, the 4-bit encoding has less redundancy than basic bracketing.

D Plane Assignment
The 2-planar and 7-bit encodings need a strategy to partition trees into two planes.We used the second-plane-averse strategy based on restriction propagation on the crossings graph (Strzyz et al., 2020).It can be summarized as follows: 1.The crossings graph is defined as an undirected graph where each node corresponds to an arc in the dependency tree, and there is an edge between nodes a and b if arc a crosses arc b in the dependency tree.
2. Initially, both planes are marked as allowed for every arc in the dependency tree.
3. The arcs are visited in the order of their right endpoint, moving from left to right.Priority is given to shorter arcs if they have a common right endpoint.Once sorted, we iterate through the arcs.
4. Whenever we assign an arc a to a given plane p, we immediately propagate restrictions in the following way: we forbid plane p for the arcs that cross a (its neighbors in the crossings graph), we forbid the other plane (p ′ ) for the neighbors of its neighbors, plane p for the neighbors of those, and so on.
5. Plane assignment is made by traversing arcs.
For each new arc a, we look at the restrictions and assign it to the first plane if allowed, otherwise to the second plane if allowed, and finally to no plane if none is allowed (for non-2-planar structures).

E Hexatagging
Amini et al. ( 2023) use an intermediate representation, called binary head trees, that acts as a proxy between dependency trees and hexatags.These trees have a structure akin to binary constituent trees in order to apply the tetra-tagging encoding (Kitaev and Klein, 2020).In addition, non-terminal intermediate nodes are labeled with 'L' or 'R' based on whether the head of the constituent is on its left or right subtree.We direct the reader to the paper for specifics.However, a mapping between projective dependency trees and this structure can be achieved by starting at the sentence's root and conducting a depth-first traversal of the tree.The arc representation components for each hexatag encode: (i) the original label corresponding to the tetratag, and (ii) the value of the non-terminal symbol in the binary head tree.

F Speed comparison
Table 14 compares the speed of the models over an execution on a single CPU. 7It is important to note that while SuPar is an optimized parser, in this context, we used MaChAmp as a general sequence labeling framework without specific optimization for speed.With a more optimized model, practical processing speeds in the range of 100 sentences per second on CPU or 1000 on a consumer-grade GPU should be achievable (cf. the figures for sequencelabeling parsing implementations in (Anderson and Gómez-Rodríguez, 2021)).

G Non-Surjectivity in Decoding
As mentioned in the main text, all encodings explored in this paper are non-surjective, meaning that there are label sequences that do not correspond to a valid tree.In these cases, the labels
Figure2: A non-projective tree and its 7-bit encoding.

Table 1 :
Number of labels (L) and coverage (C) for each treebank and encoding.B and B-2P are the baselines.

Table 2 :
LAS for the linguistically-diverse test sets

Table 3 :
LAS for the low-resource test sets

Table 4 :
LAS comparison against related parsers, for the linguistically-diverse test sets.

Table 5 :
LAS comparison against related parsers, for the low-resource test sets.

Table 6 :
Statistics for the linguistically-diverse set of treebanks: percentage of projective trees, 1-planar trees, percentage of rightward arcs (r arcs), and average dependency distance (avg d).

Table 7 :
Statistics for the low-resource set of treebanks: percentage of projective trees, 1-planar trees, percentage of rightward arcs (r arcs), and average dependency distance (avg d).
, in the same notation as in Table1.As can be seen in the table, the trends are analogous to those for the other treebanks (Table1in the main text).

Table 9 :
Full tree coverage for each encoding on the linguistically-diverse set of treebanks.

Table 10 :
Full tree coverage for each encoding on the low-resource set of treebanks.

Table 12 :
Unique labels generated when encoding the training sets of the low-resource set of treebanks, including dependency types as a component of the labels.