Enhanced Universal Dependency Parsing with Automated Concatenation of Embeddings

This paper describe the system used in our submission to the IWPT 2021 Shared Task. Our system is a graph-based parser with the technique of Automated Concatenation of Embeddings (ACE). Because recent work found that better word representations can be obtained by concatenating different types of embeddings, we use ACE to automatically find the better concatenation of embeddings for the task of enhanced universal dependencies. According to official results averaged on 17 languages, our system rank 2nd over 9 teams.


Introduction
Compared to the Universal Dependencies (UD) (Nivre et al., 2016), the Enhanced Universal Dependencies (EUD) (Bouma et al., 2020(Bouma et al., , 2021) ) 1 makes some of the implicit relations between words more explicit and augments some of the dependency labels to facilitate the disambiguation of types of arguments and modifiers.The representation of EUD is an enhanced graph with reentrancies, cycles, and empty nodes.Such representation can represent richer grammatical relations than rooted trees, but it is harder to learn.To make the learning process relatively easy, we transfer the enhanced graph to a bi-lexical structure like annotation of semantic dependency parsing (SDP) (Oepen et al., 2015) by reducing reentrancies and empty nodes into new labels.Therefore, many approaches for SDP can be adopted by EUD.Instead of the second-order parser that was used in previous work (Wang et al., 2019(Wang et al., , 2020b;;Wang and Tu, 2020), we apply the biaffine parser (Dozat and Manning, 2018) which is one of the state-of-theart approaches of SDP for simplicity.
Recent developments on pre-trained contextualized embeddings have significantly improved the performance of structured prediction tasks in natural language processing.A lot of work has also shown that word representations based on the concatenation of multiple pre-trained contextualized embeddings and traditional non-contextualized embeddings (such as word2vec (Mikolov et al., 2013) and character embeddings (Santos and Zadrozny, 2014)) can further improve performance (Peters et al., 2018;Akbik et al., 2018;Straková et al., 2019;Wang et al., 2020a).Wang et al. (2021) proposed Automated Concatenation of Embeddings to automate the process of finding better concatenations of embeddings and further improved performance of many tasks.We utilize their method to find concatenations of pretrained embeddings as the input of the biaffine parser for EUD.Because there are many contextualized embeddings, such as XLMR (Conneau et al., 2020a), BERT (Devlin et al., 2018) and Flair (Akbik et al., 2018), non-contextualized embeddings, such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and fastText (Bojanowski et al., 2017), and character embeddings (Santos and Zadrozny, 2014).The search space of embeddings concatenation is large in size, besides, we need to train models of 17 languages respectively.Following Wang et al. (2021), we use reinforcement learning to efficiently find the better embeddings concatenation for each language.Experimental results averaged on 17 languages show the effectiveness of our approach.Our system is ranked 2nd over 9 teams in the official evaluation.

Data Pre-processing
We adopt the same data pre-processing method as Wang et al. (2020b) which transfers EUD graphs to SDP graphs.For the reentrancies of the same head and dependent on different labels in the EUD graph, we combined these arcs into one and concatenate the labels of these arcs with a special symbol '+' representing the combination of two arcs.For the empty nodes in the EUD graph, there is an official script that can reduce such empty nodes into non-empty nodes with new dependency labels2 .

Approach
We follow the approach of Wang et al. (2021) 3 to build our system.Our system contains two parts: an ACE module to determine embedding concatenation as inputs, a biaffine parser to predict edges' existence and labels between each word pair.We introduce these two parts respectively.ACE Given a sentence with n words w = [w 1 , w 2 , ..., w n ], we first get the input representations V ∈ R d×n for the sentence, where v i is word representation of i-th word and it is a concatenation of L types of word embeddings: where embed l is the model of l-th embeddings, as an mask to choose a subset of embeddings of L types and mask out the rest.Thus, the embeddings become: where a l is a binary variable.To learn this mask (i.e., embeddings concatenation), we set a controller which interact with our EUD parser to iteratively generate the embedding mask from the search space.Defined the probability distribution of selecting an concatenation a as P ctrl (a; θ) = L l=1 P ctrl l (a l ; θ l ).Each element a l of a is sampled independently from a Bernoulli distribution, which is defined as: where σ is the sigmoid function.
We use reinforcement learning and take the accuracy on development set of our EUD parser as reward signal R. The controller's target is to maximize the expected reward J(θ) = E P ctrl (a;θ) [R] through the policy gradient method (Williams, 1992).We defined the reward function as: Where γ ∈ (0, 1).|a t − a i | is a binary vector, representing the change between current embedding concatenation a t at current time step t and a i at previous time step i.R t and R i are the reward at time step t and i. Hamm(a t , a i ) is the Hamming distance of two concatenations.
Since calculating the exact expectation is intractable in our approach, the gradient of J(θ) is approximated by sampling only one selection following the distribution P ctrl (a; θ) at each step for training efficiency.With the reward function, the final formulation is: EUD Parser After getting the representation V of the sentence w, we use a three-layer BiLSTM taking the representation as input: Where R = [r 1 , . . ., r n ] represents the output from the BiLSTM.For the arc prediction and label prediction, we use two different feed-forward networks and biaffine functions: The arc probability distribution and the label probability distribution for each potential arc are: ; 0]) , we first use MST (McDonald et al., 2005) algorithm to get a tree structure, then we additionally add arcs for the positions that s (arc) ij > 0. Such method can get a EUD graph and ensure the connectivity of the graph.Wang  (2020b) shows that the non-projective tree algorithm (MST) is better than the projective tree algorithm (Eisner's) for the EUD task.We select the label with the highest score of each potential arc.
Given any labeled sentence (w, Y ), where Y stands for a gold parse graph, to train the system, we follow the approach of Wang et al. (2019) with the cross entropy loss: where Λ is the parameters of our system, 1(y (arc) ij ) denotes the indicator function and equals 1 when edge (i, j) exists in the gold parse and 0 otherwise, and i, j ranges over all the tokens w in the sentence.The two losses are combined by a weighted average.
Where λ is a hyper-parameter.

Experimental Settings
In training, we use the official development set as the development set.We tune the hyperparameters on the development set and determine the hyper-parameter values according to the labeled F1 score (LF1) which is the evaluation metric used in SDP.LF1 measures the correctness of each arc-label pair.We use a batch size of 2000 tokens with the Adam (Kingma and Ba, 2015) optimizer.We set 30 steps of reinforcement learning, and the time of each reinforcement learning step depends on the size of data set.The hyperparameters of our biaffine parser are shown in Table 5, which are mostly adopted from previous work on dependency parsing.For the hyperparameters of our ACE module, we follow the settings of Wang et al. (2021).We only use the tokenized words as the model input.For the sentence  and word segmentation, we used the pretrained large model of trankit (Nguyen et al., 2021).The embeddings we used in the ACE module for each language are shown in Table 1.For transformerstyle embeddings, we only take the hidden states of the topmost layer and we only take the first piece subword representation as the multi-pieces word representation.We built our codes based on PyTorch (Paszke et al., 2019), and trained the model for each language on a single Tesla V100 GPU.

Main Results
Table 2 shows the ELAS scores (defined as F1score over the set of enhanced dependencies in the system output and the gold standard) on development set of biaffine parser with fine-tuning single XLM-R embedding and with our ACE module.We can see that with ACE, the performance of most languages models is improved a lot.Table 3 shows the results of official evaluations of all teams.We only show the ELAS in the results.We can see that our model gets the 1st on the Arabic language and gets the 2nd on averaged ELAS over 17 languages.

Tokenization Performances of Different Toolkits
In our experiments, we have tried two different tokenization toolkits.One is stanza (Qi et al., 2020) which is from Standford NLP Group, the others is trankit (Nguyen et al., 2021) which is a lightweight Transformer-based Python Toolkit for multilingual NLP.We use pretrained models of the two toolkits respectively.Furthermore, We train tokenization model of stanza for each language.
Both settings of stanza are worse than trankit on sentence segmentation score.Table 4 shows the sentences and words segmentation scores of stanza trained on each language and pretrained trankit.We see that although stanza is better than trankit on segmentation score of tokens, there is a huge performance gap on segmentation score of sentences between trankit and stanza.Therefore, the final ELAS on test set tokenized by trankit is better than stanza.

Conclusion
Our system is a parser with automated embeddings concatenation and a biaffine encoder.Empirical results show the effectiveness of ACE to enhanced universal dependencies.Our system ranks 2nd over 9 teams according to the official ELAS.

Table 1 :
The embeddings we used in our system.The URL is where we downloaded the embeddings.'all' means we use the model for all the languages.'other' means we use this RoBERTa model for all the languages except the uk, ru and nl.
et al.

Table 2 :
Compared ELAS scores on development set of fine-tuning single XLM-R embedding and ACE.

Table 3 :
Official results of all systems.

Table 4 :
Comparison of different tokenization toolkits.

Table 5 :
Hyper-parameters for our system.