The DCU-EPFL Enhanced Dependency Parser at the IWPT 2021 Shared Task

We describe the DCU-EPFL submission to the IWPT 2021 Parsing Shared Task: From Raw Text to Enhanced Universal Dependencies. The task involves parsing Enhanced UD graphs, which are an extension of the basic dependency trees designed to be more facilitative towards representing semantic structure. Evaluation is carried out on 29 treebanks in 17 languages and participants are required to parse the data from each language starting from raw strings. Our approach uses the Stanza pipeline to preprocess the text files, XLM-RoBERTa to obtain contextualized token representations, and an edge-scoring and labeling model to predict the enhanced graph. Finally, we run a postprocessing script to ensure all of our outputs are valid Enhanced UD graphs. Our system places 6th out of 9 participants with a coarse Enhanced Labeled Attachment Score (ELAS) of 83.57. We carry out additional post-deadline experiments which include using Trankit for pre-processing, XLM-RoBERTa LARGE, treebank concatenation, and multitask learning between a basic and an enhanced dependency parser. All of these modifications improve our initial score and our final system has a coarse ELAS of 88.04.


Introduction
The IWPT 2021 Parsing Shared Task: From Raw Text to Enhanced Universal Dependencies (Bouma et al., 2021) is the second task involving the prediction of Enhanced Universal Dependencies (EUD) graphs 1 following the 2020 task (Bouma et al., 2020). EUD graphs are an extension of basic UD trees, designed to be more useful in shallow natural language understanding tasks (Schuster and Manning, 2016) and lend themselves more easily to the 1 https://universaldependencies.org/u/ overview/enhanced-syntax.html representation of semantic structure than strict surface structure dependency trees. In the shared task, the enhanced graphs must be predicted from raw text, i.e. participants must segment the input into sentences and tokens. Participants are encouraged to predict lemmas, Part-of-Speech (POS) tags, morphological features and basic dependency trees as well.
Our system, DCU-EPFL, uses a single multilingual Transformer (Vaswani et al., 2017) encoder, namely XLM-RoBERTa (XLM-R) (Conneau et al., 2020), which is a multilingual RoBERTa model (Liu et al., 2019), to obtain contextualized token encodings. These are then passed to the enhanced dependency parsing model. The system is straightforward to apply to new languages with enhanced UD annotations. In the official submission, we use the same hyper-parameters for all languages. Our parsing component can produce arbitrary graphs, including graph structures where words may have multiple heads and cyclic graphs. Our system uses the following three components: 1. Stanza (Qi et al., 2020) for sentence segmentation, tokenization and the prediction of all UD features apart from the enhanced graph.
2. A Transformer-based dependency parsing model to predict Enhanced UD graphs.
3. A post-processor ensuring that every graph is a rooted graph where all nodes are reachable from the notional root token.
Our official system placed 6th out of 9 teams with a coarse Enhanced Labeled Attachment Score (ELAS) of 83.57. In a number of unofficial postevaluation experiments, we make four incremental changes to our pipeline approach: 2. We use XLM-R Large instead of XLM-R Base .
3. We concatenate treebanks from the same language which have more than one training treebank and concatenating English treebanks to the Tamil training data.
4. We introduce a novel multitask model which parses the basic UD tree and enhanced graph in tandem.
All of these additional steps improved our evaluation scores, and for our final system, which incorporates all additional modifications, our evaluation score increases from 83.57 to 88.04. Our code is publicly available. 2

Related Work
In this section, we discuss the relevant literature related to Enhanced Universal Dependencies.

Enhanced Universal Dependencies
Despite the recent wave of Deep Learning models and accompanying analyses that show that such models learn information about syntax, there is still interest and merit in utilizing hierarchically structured representations such as trees and semantic representations to provide greater supervision about what is taking place in a sentence (Oepen et al., 2019). While dependency trees are often used in downstream applications, their structural restrictions may hinder the representation of content words (Schuster and Manning, 2016). The Enhanced UD representation tries to fill this gap by enabling more expressive graphs in the UD format, which capture phenomena such as added subject relations in control and raising, shared heads and dependents in coordination, the insertion of null nodes for elided predicates, co-reference in relative clause constructions and augmenting modifier relations with prepositional or case-marking information. Schuster and Manning (2016) build on the Stanford Dependencies (SD) initiative (de Marneffe et al., 2006) and extend certain flavors of the SD dependency graph representations to UD in the form of enhanced UD relations for English. They use a rule-based system that converts basic UD trees to enhanced UD graphs based on dependency structures identified to require enhancement. Nivre et al. (2018) use rule-based and data-driven approaches in a cross-lingual setting for bootstrapping enhanced UD representations in Swedish and Italian and show that both techniques are capable of annotating enhanced dependencies in different languages.

The IWPT 2020 Shared Task on Parsing
Enhanced Universal Dependencies The first shared task on parsing Enhanced Universal Dependencies (Bouma et al., 2020) brought renewed attention to the problem of predicting enhanced UD graphs. Ten teams submitted to the task.
The winning system (Kanerva et al., 2020) utilized the UDify model (Kondratyuk and Straka, 2019), which uses a BERT model (Devlin et al., 2019) as the encoder with multitask classifiers for POStagging, morphological prediction and dependency parsing built on top. They developed a system for encoding the enhanced representation into the basic dependencies so it can be predicted in the same way as a basic dependency tree but with enriched dependency types that can then be converted into the enhanced structure. In an unofficial submission shortly after the task deadline, Wang et al. (2020) outperform the winning system using second-order inference methods with Mean-Field Variational Inference.
Most systems used pretrained Transformers to obtain token representations, either by using the Transformer directly (Kanerva et al., 2020;Grünewald and Friedrich, 2020;He and Choi, 2020) or passing the encoded representation to BiL-STM layers where they are combined with other features such as context-free FastText word embeddings (Wang et al., 2020), character features and features obtained from predicted POS tags, morphological features and basic UD trees (Barry et al., 2020), or are used as frozen embeddings (Hershcovich et al., 2020). The only transition-based system among the participating teams (Hershcovich et al., 2020) used a stack-LSTM architecture (Dyer et al., 2015). Ek and Bernardy (2020) and Dehouck et al. (2020) combine basic dependency parsers and a rule-based system to generate EUD graphs from the predicted trees.

Official System Overview
This section describes our official system, which is the system we submitted prior to the competition deadline. The architecture of our system is  shown in Figure 1. 3 The raw text test files for each language contain a mixture of test data covering multiple treebanks, so participants do not know their exact domain. For our official system, we choose the model trained on the treebank with the most amount of training data in terms of sentences for each language to process the test files. This heuristic corresponds to using Czech-PDT for Czech, Dutch-Alpino for Dutch, English-EWT for English, Estonian-EDT for Estonian and Polish-PDB for Polish.

Pre-processing
For sentence segmentation, tokenization and the prediction of the base UD features (all UD features apart from the enhanced dependency graphs and miscellaneous items in CoNLL-U files), we use the Stanza library (Qi et al., 2020) trained on version 2.7 of the UD treebanks for each treebank released as part of the training data for the shared task. 4 Note that our parser does not pre-suppose any input features other than the input text but we predict the base features using our pre-processing pipeline for completeness and to enable possible additional post-processing which involves altering enhanced dependency labels with lemma information.

Enhanced UD Parsing
For the enhanced UD parser, we use a Transformer encoder in the form of XLM-R (Conneau et al., 2020) with a first-order arc-factored model which utilizes the edge and label scoring method of (Kiperwasser and Goldberg, 2016). In initial experiments, we found this model to perform better than biaffine attention (Dozat and Manning, 2016) for the task of EUD parsing. This finding was also made by (Lindemann et al., 2019) and (Straka and Straková, 2019) for the task of semantic parsing across numerous Graphbanks (Oepen et al., 2019). Straka and Straková (2019) suggest that biaffine attention may be less suitable for predicting whether an edge exists between any pair of nodes using a predefined threshold and is perhaps more suited for dependency parsing, where words are competing with one another to be classified as the head in a softmax layer. The consistency of these findings across EUD and semantic parsing Graphbanks may provide evidence that enhanced UD is closer to semantic dependency parsing than basic UD parsing.
Parser Implementation Given a sentence x of length n, our model computes vector representations R = (r 1 , r 2 , ..., r n ) for the predicted tokens (x 1 , x 2 , ..., x n ). Since the WordPiece tokenization (Wu et al., 2016) of XLM-R differs from the tokenization used in UD, we track the mapping I from XLM-R's k-th sub-word unit of the j-th input token produced by Stanza to the sub-word unit's position I j,k in context of the sentence and we consider the output vector e I j,1 of the first subword unit of each word x j as its vector representation (r j ): where E = (e 1 , ..., e N ) are the output vectors of all sub-word units, N being the total number of subword units in the sentence, and Filter() chooses the first embedding for each token. We add a dummy representation of the same dimensionality for the ROOT token to the sequence of vectors R but mask out predictions from this token. Following Kiperwasser and Goldberg (2016), these representations R = (r 1 , r 2 , ..., r n ) are then passed to the dependency parsing component, where the feature function φ is the concatenation of the representations of a potential head word x h and dependent word x d , where • denotes concatenation: Edge Prediction We compute scores for all n(n − 1) potential edges (h, d), h = d, with an MLP: The edge classifier computes scores for all possible head-dependent pairs, and we compute a sigmoid on the resulting matrix of scores to obtain probabilities. We use an edge prediction threshold of 0.5, i.e. we include all edges with a score above 0.5 in the preliminary EUD graph. This enables words to have multiple heads but it can also lead to words receiving no head, where we manually select the edge that has the highest probability, and to fragmented graphs, see post-processing in Section 3.3.

Label Prediction
To label the graph, we then choose a label for each edge using a separate classifier: The scores for all possible labels are passed to a softmax layer, which outputs the probability of each label for edge (h, d) and we select the label with the highest probability for each edge.
Loss Function For edge prediction, sigmoid cross-entropy loss is used, and for label prediction, as we want to select the label for each chosen edge, softmax cross-entropy loss is used (Dozat and Manning, 2018). We interpolate between the loss given by the edge classifier and the loss given by the label classifier (Dozat and Manning, 2018;Wang et al., 2020) with a constant λ: Training details For the empty nodes which are prevalent in enhanced UD graphs, we added them into the graph, and offset the head indices to account for the new token(s) added to the graph. At test time, we did not predict whether an elided token should be added to the graph. Due to time constraints, we trained using the full lexicalized enhanced dependency labels but intend to devise a delexicalization and relexicalization procedure in future work.

Post-processing
In the Enhanced UD guidelines, the predicted structure must be a connected graph where all nodes are reachable from the notional root 5 . After predicting the test files, we use the graph connection tool in (Barry et al., 2020) to make sure that each sentence is a connected graph. Specifically, we repeatedly check for unreachable nodes and the number of unreachable nodes that can be reached from them. We choose the candidate which maximises this number (in the case there are ties, we choose the first node in surface order) and makes it a child of the notional ROOT, i.e. this node becomes an additional root node. System outputs are then validated at level 2 by the UD validator 6 to catch bugs prior to submission.

Experiments
In this section, we discuss our official results and then describe post-deadline experiments that improved our submission's score. Model hyperparameters are listed in Table 1. The choice of XLM-R encoder (Base or Large) determines the hyperparameters of the encoder part of our model. In our official submission, we use XLM-R Base . A dropout value of 0.35 is used for the input embeddings as well as for the encoder and MLP networks. A loss interpolation constant λ of 0.1 is used as in (Wang et al., 2020).

Official Submission
For the official submission, we use the Stanza preprocessing pipeline and our dependency parsing model with XLM-R Base . The results are listed in 5 In UD, the notional ROOT is the token with ID 0, whereas a root node is any node that has 0 as its head. 6 https://github.com/ UniversalDependencies/tools/blob/master/ validate.py

Trankit Pre-processing
In a post-deadline experiment, we replace the Stanza pre-processing pipeline (which uses Word2Vec and FastText embeddings as external input features and a BiLSTM encoder) with Trankit (Nguyen et al., 2021), which uses the Transformer XLM-R as the encoder. The results from adopting Trankit for sentence segmentation and tokenization are listed in column [2] of Table 2. We notice slight improvements for all languages, with notable exceptions being Arabic, Dutch and Slovak, where the better pre-processing accounts for a 24.2%, 25.5% and 11.7% relative error reduction.

XLM-R Large
Our next modification is to leverage the XLM-R Large model. This model has roughly twice as many parameters as the XLM-R Base model used in our official submission. The results for combining Trankit pre-processing and using XML-R Large are listed in column [3] of Table 2. The larger capacity of the model translates to large relative error reductions particularly for Finnish, French, Latvian, Lithuanian, Swedish, Tamil and Ukrainian. Given the improvements seen by adopting both Trankit for pre-processing and the larger XLM-R Large model, we now incorporate these modifications into all further experiments.

Treebank Concatenation
In our official system, we used just one treebank per language. Our next experiment is to investigate the effect of concatenating all treebanks with enhanced UD annotations for a language. We hypothesize that there could be a positive transfer from learning similar (within-language) treebanks and that it would make our parser more robust to the multiple domains in the test data. This means that for Czech we concatenate the PDT, CAC and FicTree treebanks, for Dutch, Alpino and LassySmall, for English EWT and GUM, and for Estonian EDT  Table 3: Evaluation scores on the official test data on the language-specific test files submitted by each team. We also include the official reference system (off. reference) which copies the gold tree to the enhanced graph as well as (our best run) which is our best post-deadline run, which corresponds to the +Concat+MTL run in Table 2. The first and second top scoring models in each language are specified with black and blue color, respectively. and EWT. For Tamil, we concatenate English EWT and GUM training data to Tamil to address the very poor evaluation score of our official submission, taking inspiration from Wang et al. (2020) who observe substantial positive effects when they add Czech and English data to the Tamil treebank. 7 The results are listed in column [4] of Table 2. Treebank concatenation helps for all languages but most notable is the improvement of over 12 points ELAS or a relative error reduction of 24% for Tamil, the language with the least amount of training data in the task.

Joint Learning of Basic and Enhanced Dependency Parsing
The official reference system submitted by the shared task organizers which copies the gold trees to the enhanced representation performs very well with 79.87 ELAS (see Table 3). Thus, there is evidence that the basic tree and enhanced graph contain a lot of mutual information. Previous methods which have leveraged the basic representation for producing EUD graphs (see Sec. 2) have focused on using heuristic rules to convert the basic tree to EUD (Schuster and Manning, 2016;Ek and Bernardy, 2020;Dehouck et al., 2020), using the basic tree as input features to the enhanced parsing model (Barry et al., 2020) or converting the enhanced graph to a richer basic representation (Kanerva et al., 2020). 7 We did not include Czech to reduce training time.
In our final experiment, we try to leverage the information from the basic tree by jointly learning to predict the enhanced graph and the basic tree, testing whether performing basic dependency parsing and EUD parsing in a multitask setup is beneficial for EUD parsing. Given the positive effects seen through concatenation, for those languages where we performed concatenation, we also train multitask models on the concatenated versions of treebanks. We use our EUD parsing model as in Section 3 and integrate with additional basic dependency parsing component (as shown in the right part of Figure 1) which is the biaffine parsing model of Dozat and Manning (2016) and train both parsers jointly. The losses of the two components are combined with equal weight. The results are listed in column [5] of Table 2.
Single Treebanks First, we compare the multitask model to the XLM-R Large run for languages where we did not perform concatenation. Predicting the basic tree and the enhanced graph in a multitask setting yields improvements for all languages, particularly for Arabic, French and Ukrainian.

Multitask Model and Treebank Concatenation
When used alongside treebank concatenation, multitask learning can help for Dutch, Estonian and Tamil where it provides additional performance gains. It is interesting to note that concatenation alone is more helpful for Czech and English where we see slight performance drops and multitask learning is not helpful when trained on concatenated Polish treebanks.
The positive contribution of multitask learning for all languages when not performing treebank concatenation, could mean that it would be useful in settings where only one treebank with the enhanced representation is available for a language and the basic tree could be used as auxiliary information to predict the enhanced representation.
Comparison to Official Systems Our best unofficial run +Concat+MTL is added to Table 3. Compared to the other official runs, the ELAS scores of this run ranks in second place for 13/17 languages and places first for Italian and Russian.

Conclusion
We have described the DCU-EPFL submission to the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies. Our approach uses a single multilingual Transformer encoder as well as an enhanced dependency parsing component. Our official system placed 6th out of 9 teams. In post-deadline experiments, we show how our submission can be improved by leveraging better upstream pre-processing, a larger encoder, concatenating treebanks as well as introducing a multitask parser that can parse the basic tree and enhanced graphs jointly.