A Unified Encoding of Structures in Transition Systems

Transition systems usually contain various dynamic structures (e.g., stacks, buffers). An ideal transition-based model should encode these structures completely and efficiently. Previous works relying on templates or neural network structures either only encode partial structure information or suffer from computation efficiency. In this paper, we propose a novel attention-based encoder unifying representation of all structures in a transition system. Specifically, we separate two views of items on structures, namely structure-invariant view and structure-dependent view. With the help of parallel-friendly attention network, we are able to encoding transition states with O(1) additional complexity (with respect to basic feature extractors). Experiments on the PTB and UD show that our proposed method significantly improves the test speed and achieves the best transition-based model, and is comparable to state-of-the-art methods.


Introduction
Transition systems have been successfully applied in many fields of NLP, especially parsing (dependency parsing (Nivre, 2008), constituent parsing (Watanabe and Sumita, 2015), and semantic parsing (Yin and Neubig, 2018)). Basically, a transition system takes a series of actions which attach or detach some items (e.g., sentence words, intermediate outputs) to or from some structures (e.g., stacks, buffers, partial trees). Given a set of action series, a classifier is trained to predict the next action given a current configuration of structures in the transition system. The performances of the final system strongly depend on how well the classifier encodes those transition system configurations.
Ideally, a good configuration encoder should encode transition system structures completely and * This work was conducted when Tao Ji was interning at Alibaba DAMO Academy. 1 https://github.com/AntNLP/trans-dep-parser. efficiently. However, challenges appear when we try to have the cake and eat it. For example, traditional template-based methods (Chen and Manning, 2014) are fast, but only encode partial information of structures (e.g., few top items on stacks and buffers). Structure-based networks (e.g., Stack-RNN (Dyer et al., 2015)) rely on carefully designed network architecture to get a full encoding of structures (actually, they still miss some off-structure information, see our discussions in Section 4.2), but they are usually slow (e.g., not easy to batch). Furthermore, different structures have different ways of update (stacks are first-in-last-serve, buffers are first-in-first-serve), it also takes efforts to design different encoders and ways of fusing those encoders.
In this work, we aim to provide a unified encoder for different transition system structures. Instead of inspecting different structures individually, to unify the encoding, we turn to inspect items in each structure, which are ultimate targets for any structure encoder. One key observation is that every item has two-views, namely structure-invariant view which is unchanged when the item is placed on different structures, and structure-dependent view which reflects which part of which structure the item stands. For example, when a word w (item) is on the buffer (structure), its structure-invariant view could contain its lexical form and part-of-speech tag, while its structure-dependent view indicates that w is now sitting on the buffer and its distance to the buffer head is p. When w is detached from buffer and attached to the stack, its structure-dependent view will switch to "sitting on the stack" while its structure-invariant view stay unchanged. A unified structure encoder thus suffices to uniformly encode both views.
For the structure-invariant view, we share them among different structures, thus it is automatically unified. For the structure-dependent view, we propose a simple yet powerful encoder. It assigns each structure a set of indicating vectors (structure indicators), each indicator specifies certain part of that structure. For example, we use indicators (vectors) to expressing "on top of the stack", "the second position of the buffer", and "index of head words in partial trees". To encode an item, we only need to concatenate its structure-invariant encoding and corresponding indicators according to its position in that structure.
Regarding completeness and efficiency, we find that with structure indicators, it is relatively easy to encode a structure completely: one only needs to decompose the structure into identifiable subparts. In fact, we can use them to track some parts of structures which are not revealed in previous work (e.g., words have been popped out from stacks). It runs in the same manner as templated-based models, thus the decoding efficiency is guaranteed. We also note that using structure indicator is different from existing ways to include structure information into neural network models (Shaw et al., 2018;Shiv and Quirk, 2019): it encodes dynamical structures (changing with transition system running) rather than static structures (e.g., fixed parse trees).
We can easily implement the unified structure encoding with existing multi-head attention networks (MHA, (Vaswani et al., 2017)). It is also easy to fuse encodings of different structures with multilayer MHA. We conduct experiments on the English Penn Treebank 3.0 and Universal Dependencies v2.2, show that the unified structure encoder is able to help us achieving state-of-the-art transitionbased parser (even competitive to the best graphbased parser), while retaining a fast training and testing speed.

Transition Systems
We briefly review transition-based dependency parsing. Given a sentence x = root 0 , w 1 , · · · , w n (root 0 is a synthetic word) and a relation set R, we denote a dependency tree for x to be {(i, j, r)}, where (i, j, r) represents a dependency relation r ∈ R between w i (head) and w j (dependent).
A transition system is a sound quadruple S =  Figure 2: A running example of the Arc-hybrid transition-based parsing. The above gold tree is constructed after performing 8 correct actions. We will use the grey row as an example for the structural indicator.
A is the set of actions, c x is an initialization function mapping x to a unique initial configuration, and C t ⊆ C is a set of terminal configurations. Given a configuration c ∈ C, a transition-based parser aims to predict a correct action a ∈ A and move to a new configuration. We specifically describe the arc-hybrid system (Kuhlmann et al., 2011). In this system, each configuration c = (σ|i, j|β, T) aggregates information from three structures, namely a stack (σ, where σ|i denotes the stack top is w i ), a buffer (β, where j|β denotes the buffer front is w j ) and a partial tree (T). The actions of arc-hybrid are formalized as (a running example is shown in Figure 2): There are three actions, sh moves the front item of the buffer (w i ) to the top item of the stack. la r removes the top item the stack w i , attaches it as a dependent to w j with label r, and adds a left-arc (j, i, r) to the partial tree. ra r removes the top of the stack w j attaches it as a dependent to w i with label r, and adds a right-arc (i, j, r) to the partial tree. We note that all actions are actually attaching or detaching item to or from structures, where an item could be a word (in stack and buffer) or an edge (in the partial tree). Note that besides structures in the configurations, we can also incorporate other structures to help learning action predictors. For example, we can consider the history action list which contain all previous actions in a sequential manner. In this case, an item in this action list is an action label.

Two Views of an Item
We can see that, while its "content" remains the same, an item may appear in different structures in a transition system's configurations. To uniformly encode an item (and thus structures containing them), we can decouple the encoding of contents and structures, then combine them in a unified way. This simple method also suggests us to design unified structure encoders which make the whole transition system model concise and efficient.
The structure-invariant view typically captures the lexical (shallow) form of an item. For example, in the arc-hybrid system, items in stacks and buffers have words as their structure-invariant view, items in action list have actions as their structureinvariant view. This view is shared when the item moving from one structure to another, and we only need to encode it once (e.g., no matter the stack or the buffer a word appears, its structure-invariant representation is identical). We describe how to encode this view in Section 4.3.
The more interesting problem is how to characterize the structure-dependent view. We would like to have a unified strategy to represent those structures. Our major tool is structure indicators, which are basically a set of vectors bounded to each structure. Taking the stack for example. We use vectors to indicate "the top of stack" (we name the vector with "1 σ "), "the second to the stack top" (naming with "2 σ "). Vector "0 σ " indicates the parts haven't been in the stack. Different with previous work, we could also represent "the previous stack top which has been popped" (a vector naming by "−1 σ "), and "the previous previous stack top" (a vector naming by "−2 σ "). That is, for different parts of the structure, we employ vectors to indicate them.
Similarly, for the buffer we have another set of structure indicators {1 β , −1 β , 2 β , −2 β · · · }, where a positive number indicates the position in the buffer, a negative number indicates the time step passed since the item has been removed from the buffer. For the partial tree with dependency relation, we decompose it into two structures, the tree arc (T arc ) and the dependency relation (T rel ). A set of T arc indicators {0 Tarc , 1 Tarc , −1 Tarc · · · } indicates the position from item to its head word. A set sh la nsubj sh sh α 4 3 2 1 Figure 3: An instance of structure indicators after the 4th step in Figure 2. Grey rows indicate structureinvariant parts (σ, β, T arc and T rel are shared), and other rows indicate structure-dependent parts. To simplify, we express the relation nsubj by vector 1 T rel .
indicates the IDs of dependency relations. Vectors "0 Tarc " and "0 T rel " indicate that this dependency edge is not in partial tree. For the action list we have a set of structure indicators {1 a , 2 a , · · · } where a vector incicates the position in the list.
In Figure 3, we show the two-views of an instance from the 4th step in Figure 2. We can observe that the five different structures mentioned above have a unified form now.

Encoding with USE
When a transition system is running at moment t, the parser needs to capture as much information as possible about the current configuration to determine which is the correct action. The key is to encode the configuration containing many structures concisely and efficiently. We propose a unified structure encoder (USE) by using multi-head selfattention networks (Vaswani et al., 2017). Each head extracts a feature vector of one structure (e.g., o σ for the stack).
A common USE function maps a query and a set of key-value pairs to an output. The query vector q represents the current time step and data structure.
The key-value pairs both represent two-views of the structure. The output vector o is calculated as a weighted sum of values, where the weight assigned to each value is calculated by a scaled ( 1 √ d k ) dot-product function of the query with the corresponding key. In practice, we pack the keys and values into matrices K and V , then compute the output as: It is universal for different structures. Take the stack σ for example, we calculate the feature vector o σ,t by assigning the q σ,t , K σ,t , and V σ,t : Where m t and m σ are the marker embeddings of time step t and data structure σ; W Q σ , W K σ , W V σ are parameter matrices for linear transformation; X is the word embedding matrix 2 . We describe X and A in detail later. The S K σ,t and S V σ,t are the embedding matrices of the structural indicator. Take σ in Figure 3 for exam- Shaw et al. (2018), we use this two sets of structural embeddings for key-value pairs and add them to X to combine the information.
When the system comes to the next moment t+1, we use the m t+1 , S K σ,t+1 and S V σ,t+1 for an updated configuration. For the other four structures, we calculate their feature vectors o β,t , o α,t , o Tarc,t , and o T rel ,t by assigning the corresponding q, K, and V , respectively 3 .

Fusion of Structure Encodings
After obtaining feature vector of each structure, the encoder incorporates all of them into configuration representation c t . Here, we simply use a multilayer perceptron (MLP): Besides that, to enhance more interaction among structures, we stack L USE layers and add the previous layer's configuration vector c (l−1) t (1 < l ≤ L) when computing the query vector q (l) * ,t ( * for any structures).
contains the complete structural information, the lth layer's USE module can interact with other structures and output a more informative representation o Then, we obtain a high layer configuration representation by combining these output vectors: T rel ,t ).
2 Note that we use action embedding matrix A instead of X when encoding action list α.
3 Similar to Equation 2, we give the formulation for the other data structures in Appendix B.
σ β a T GPU in out in out Figure 4: Structural information coverage and GPUfriendliness of different feature extractors. indicates complete extraction, indicates partial extraction, indicates GPU parallel friendly, and indicates unfriendly.
We set different layers with different parameters (preliminary experiments suggest shared parameter performs worse). To support deeper networks, the residual connection and layer normalization (Ba et al., 2016) are employed on MLP and USE modules. Finally, we use c (L) t of the last layer to classify action.
Basically, we need at least 5 attention heads to extract full structures (each head corresponds to one structure). Vaswani et al. (2017) noted that a multi-head attention layer has a constant number (O(1)) of sequentially executed operations, which means that efficient GPU-based computing is possible. In training, the USE calculations at different moments are independent of each other, so we can pack them into the batch dimension to obtain an O(1) training complexity. Hence, USE can uniformly extract full structure features efficiently.

Comparing to Previous Encoders
We divide previous work into three encoding methods: top-k, stack-LSTM, and binary vector. Top-k methods (Chen and Manning, 2014;Weiss et al., 2015) capture the conjunction of only few 1∼3 in-structure items. It extracts only partial structural information. Since the feature template is fixed, it is easy to batchify. Stack-LSTM methods (Dyer et al., 2015;Ballesteros et al., 2016) can efficiently represent all in-structure items, via the PUSH(·) and POP(·) functions. But it loses the information of outside parts and subtree which cannot be treated as a stack. Besides, Che et al. (2019) point out that its batch computation is very inefficient. Binary Vector methods (Zhang et al., 2017) use two binary vectors to model whether each element is in a σ or a β. It can efficiently encode some outside parts of stack and buffer but loss the information of inside position.
We compare existing work with our USE encoder in terms of the coverage of structure features and GPU computing friendly (in Figure 4). Overall, USE does not lose any structural information and more efficient than previous feature extraction schemes.

Encoding the Structure-invariant View
Given a sentence s = ω 0 , . . . , ω n , we learn a lexical vector x i for each word ω i , and pack them into matrix X for Equation 1. The vector x i is composed of three parts: the word embedding e(ω i ), the part-of-speech (POS) tag embedding e(g i ), and the character-level representation vector CharCNN(ω i ).
We simply initialize all embedding matrices in a random way. The CharCNN(ω i ) vector is obtained by feeding ω i into a character convolutional neural network (Zhang et al., 2015). To encode more sentence context, e(ω i ) is obtained by trainable bidirectional long short-term memory, Transformer encoder networks or pre-trained networks like Bert.
Given an action list α = a 0 , . . . , a m , we learn a structure-invariant vector a i for each action a i . Because the action space is only 2|R| + 1(< 10 2 ), we directly obtain a i = e(a i ) by action embedding.
Since all decoding steps share the same structureinvariant representations, they are just computed only once. In the experiments, we will discuss all mentioned encoding ways.

The Action Classifier
The action set A is first divided into three main types: sh, la and ra, then divided into |R| dependency labels only for la and ra actions. Thus we perform a two-stage process with 3-class and |R|-class classifications. It effectively reduces the classification space compared with one-stage process. For example, the space of a sh action is 3-class in two-stage process while 2|R|+1 class in one-stage process.
For the action type classification, based on c  (2017) which uses a biaffine score function to predict the label's probability, Where W 1 is a 3-dimensional parameter tensor, W 2 is a parameter matrix, and b is a parameter vector 4 . A slight difference is that we induce c (L) t to model the prior probability of each label under the current configuration.

Training Details
We have two training objectives, one is scoring the correct action higher than the incorrect action, and the second one is maximizing the probability of the correct dependency label. For correct action α * , aiming to maximize the margin between its score and the highest incorrect action (α) score, we use the hinge loss: For correct dependency label r * , aiming to maximize its probability, we use the cross-entropy loss: The final objective is to minimize a weighted combination of them: L = λ 1 L α + λ 2 L r . We follow Kiperwasser and Goldberg (2016) which use error exploration training with dynamicoracle and aggressive strategies. A parser that always takes the correct action during training will suffer from error propagation during testing. To take wrong actions, the dynamic-oracle welldefines the "correct" actions even if the current configuration cannot lead to the gold tree. The aggressive exploration forces taking a wrong action with probability p agg = 0.1.
are important for the parser? To show the importance of a variable, one standard approach is using partial derivatives multiplied by the variable's value (Denil et al., 2015). Hence, the importance score ( ) of a structural indicator is a dot product between objective function gradient and indicator embedding: Concretely, (σ, i) shows the relevance between the stack indicator i and the decision of our parser. The importance at indicator i of any structure can be derived similarly. We can further accumulate multiple items' relevance. For example, stack inside relevance in (σ) = i>0 (σ, i), stack outside relevance out (σ) = i≤0 (σ, i), etc. In the experiments, we explain the importance of each structure part by score.

Experiments
Data We conduct experiments and analysis on two main datasets including 12 languages: the English Penn Treebank (PTB 3.0) with Stanford dependencies, and the Universal Dependencies (UD 2.2)  treebanks used in CoNLL 2018 shared task (Zeman et al., 2018). The statistics of datasets are in Appendix C. For PTB, we use the standard train/dev/test splits and the external POS tags obtained by the Stanford tagger (accuracy ≈ 97.3%). Following Ji et al. (2019), we select 12 languages from UD, and use CoNLL shared task's official train/dev/test splits, where the POS tags were assigned by the UDPipe (Straka et al., 2016).
Evaluation We mainly report unlabeled (UAS) and labeled attachment scores (LAS). For evaluations on PTB, five punctuation symbols (" " : , .) are excluded, while on UD, we use the official evaluation script.  Table 1: Results on the English PTB dataset. "T" represents transition-based parsers, and "G" represents graph-based parsers. We report the average over 5 runs.

Hyper-parameters
iterations, stopping early if peak performance on dev did not increase over 100 epochs. The details of the chosen hyper-parameters in default settings are summarized in Appendix D.

Main Results
Firstly, we compare our method with previous work ( Table 1). The first part contains transition-based models. We particularly compare with the two strong baselines in the blue cell, where Ma et al. (2018) decode parse trees in a depth-first manner with a stack-pointer network, and Yuan et al. (2019) decode transition sequences in both the forward and backward directions by multi-task learning. In a fair comparison, our three unified structure encoding (USE) parsers all achieve significant improvements on PTB. This demonstrates the benefit of complete structural information by our unified encoding. Secondly, we compare with strong graph-based parsers. The second part of Table 1 contains two first-order parsers and two high-order parsers (in the red cell). Our USE parsers beat the first-order methods, but underperform the high-order methods which capture high-order features by graph neural networks and TreeCRF. However, speed experiments show that USE is about 2 times faster than them, It's our future work to bridge the performance gap by using the bi-directional transition system (Yuan et al., 2019) and stronger decoding methods (Andor et al., 2016).
Thirdly, we compare the results of three USE parsers with different transition systems (third part of Table 1). We can see that the arc-hybrid system is more expressive than the arc-eager and the arcstandard. Shi et al. (2017) demonstrate that the arceager is more expressive on a minimal feature set, but our results do not support them on a full feature set. The reason may be that, when the feature set is full, arc-eager system has one more action (REDUCE) than arc-standard in the first stage of classification.
Head Allocation Here we discuss the allocation of structural heads. Our basic idea is assigning one head to one structure, which means five heads in total. We performe two sets of ablation experiments based on the basic setup: decreasing or increasing one head for each structure respectively ( Figure 5). Decreasing one head means that the corresponding structure is not visible to the parser. Losing the information of stack or buffer severely hurts the performance. Comparatively, losing the information of action list or subtree slightly hurts the performance. This suggests that the stack and buffer are more important in arc-hybrid transition system, and we should pay more attention to them. Increasing one head shows the improved performance of giving the corresponding structure double attention. We observe obvious performance gains on the stack, buffer, and action list, which means that augmenting their information is helpful. Considering the performance gain and computational cost of adding heads, we finally use a total of 8 structural heads. The parser double attent to the stack, buffer and action list.

Lexical Representation
We analyze different lexical word representation from Section 4.3 (Table 2). The first part reports the use of context-  Our arc-hybrid parser T 918 independent Glove embeddings (Pennington et al., 2014) in the arc-hybrid system. We learn the context via BiLSTM or Transformer encoder. The results show that encoding context can further improve performance and the Transformer encoder is better than BiLSTM. The second part reports the use of contextual Bert networks (Devlin et al., 2019). The introduction of Bert networks and in particular fine-tuning usage can significantly increase the performance. Compared with Mohammadshahi and Henderson (2020), our parser performs better because it encodes the full structure rather than only top-k in-structure items. Table 3 compares the parsing speed of different parsers on PTB test set. For a fair comparison, we run all parsers with python implementation on the same machine with Intel Xeon E5-2650v4 CPU and GeForce GTX1080Ti GPU. The USE parser can parse about 918 sentences per second, over 5 times faster than the strongest transition-based parser (Ma et al., 2018). This result shows the efficiency of the attention mechanism. Compared to three graph-based parsers, our parser is nearly 2 times faster than theirs. It's because the transition-based parser decodes linearly and does not require complex decoding algorithms like minimum spanning tree or TreeCRF. Considering the parsing performance and speed together, our proposed parser is able to meet the requirements of a real-time system.   Ma et al. (2018); Zhang20: Zhang et al. (2020). We report the average over 3 runs. Interpretability Figure 6 visualizes the importance score ( ) of each structure part in arc-hybrid transition system. Consistent with the findings in Figure 5, the stack and buffer achieve higher importance scores. However, in reaching the same conclusion, the interpretable method does not require retraining of the parser. Furthermore, we observe that the outside information of the stack and buffer is more important than the subtree structure. It suggests that the transition parser should encode them. Table 4 compares our USE parser with two baselines on UD datasets. We adopt the non-projective arc-hybrid system for handling the ubiquitous non-projective trees (de Lhoneux et al., 2017). As the transition-based baseline, the parser proposed by Ma et al. (2018) was re-runned under the same hyper-parameter settings as ours. Our USE parser outperforms the baseline on all of the 12 languages, the averaged improvement is 0.37 LAS, again showing the power of the complete transition system encoder. Compared with the strongest graph-based baseline (Zhang20), our parser performs better on 4 treebanks, including bg, de, nl, and ro. These four treebanks are relatively smaller than other treebanks, probably indicating that our parser is more suitable for low resource languages. Overall, there is still a 0.15 averaged LAS gap with the graph-based baseline, and it is our future work to further improve the USE transition-based parser.

Related Work
We have already surveyed related transition system encoder in Section 4.2. Here we present several powerful transition-based parsers. Ma et al. (2018) decode a parse tree step-by-step based on a depthfirst traversal order. A stack is usually used to maintain the depth-first search. Thus they use a stackpointer network for decoding. Note that their work is not based on any transition systems. Yuan et al. (2019) propose a bidirectional decoding method for a stack-LSTMs transition-based parser. They perform joint decoding with a left-to-right parser and a right-to-left parser. Mohammadshahi and Henderson (2020) propose a Graph2Graph framework for enhancing expression by treating multiple structures as multiple sentences and using a Transformer encoder (Bert) to encode top-k words. These works focus on improving the decoding approach or representation learning of structure-invariant parts, but still follow the traditional encoders. Our work focuses on proposing a new encoder with both information completeness and computational effectiveness.
There have been several attempts to combine attention networks with structures: to represent the sequential structure better, Shaw et al. (2018) introduce relative position between words in attention networks instead of concatenating absolute position in input.  define the relative positions on parse trees to encode each word pair's tree distance. They feed these positional embeddings to attention networks too. These two works encode a static structure, while we encode a dynamically changing transition system. Shiv and Quirk (2019) extend the Transformer's sinusoidal position function to the tree structure. Similar to us, their decoder dynamically computes the new position encoding when generating a tree structure. But their structural embeddings are computed by fixed sinusoidal function, while ours are learnable. These works encode only one structure, while we encode multiple structures from a transition sys-tem.

Conclusion
We presented a comprehensive and efficient encoder for transition system. We separate each structure to the structure-invariant part and structuredependent part. It allows us to dynamically encode the complete structure and also retains the efficiency of training and testing. Experiments show that the proposed parser achieves new state-of-theart transition-based results.