End-to-end mBERT based Seq2seq Enhanced Dependency Parser with Linguistic Typology knowledge

We describe the NUIG solution for IWPT 2021 Shared Task of Enhanced Dependency (ED) parsing in multiple languages. For this shared task, we propose and evaluate an End-to-end Seq2seq mBERT-based ED parser which predicts the ED-parse tree of a given input sentence as a relative head-position tag-sequence. Our proposed model is a multitasking neural-network which performs five key tasks simultaneously namely UPOS tagging, UFeat tagging, Lemmatization, Dependency-parsing and ED-parsing. Furthermore we utilise the linguistic typology available in the WALS database to improve the ability of our proposed end-to-end parser to transfer across languages. Results show that our proposed Seq2seq ED-parser performs on par with state-of-the-art ED-parser despite having a much simpler de- sign.


Introduction
The Enhanced Universal Dependency (EUD) Parsing (Schuster and Manning, 2016;Nivre et al., 2020) framework is an interesting extension of the standard Dependency Parsing framework, which provides additional significant syntactic and semantic knowledge, that is missing in a standard dependency parse-tree. Such additional knowledge can be crucial for numerous downstream NLP tasks.
The IWPT 2021 Shared Task (Bouma et al., 2021) requires the participants to perform the enhanced dependency parsing of the given testsentences, in addition to predicting the sentenceboundaries, token-boundaries, lemmatization, POStags, morphological features and the basic dependency relations. The participants are provided with the blind test-corpora in 17 languages, and are expected to perform the enhanced dependency parsing on each sentence within these test corpora and submit the results (in the conllu format).
For this IWPT 2021 Shared Task (Bouma et al., 2021) we propose and evaluate the performance an End-to-end mBERT Based Se2seq ED-Parser which performs five key tasks namely UPOS-tagging, UFeats-prediction, Lemmatization, Dependency-parsing and Enhanced Dependencyparsing in multi-tasking settings.
Our proposed model is an extension of the popular UDify model (Kondratyuk and Straka, 2019) which is the state-of-the-art mBERT based multilingual dependency parser, and is inspired by (Li et al., 2018) which is an End-to-end Seq2seq Dependency-Parser. We describe the UDify model in Section 2.
We trained our proposed ED-Parser on a large joint polyglot corpus created by concatenating all the treebanks in the provided training dataset for IWPT 2021 Shared Task, and evaluated it on eight of the 17 provided blind test-corpora.
Furthermore, similar to previous approaches (Ammar et al., 2016), we utilized the Linguistic Typology knowledge available in World Atlas of Language System (WALS) database (Haspelmath, 2009) to improve the cross-lingual transferring ability of our proposed ED-parser. We fed these typology features together with token-ids into the proposed ED-parser. We describe the architecture of our End-to-end mBERT Based Se2seq ED-Parser in detail in Section 3.
2 Background and Related Work 2.1 Seq2seq Dependency Parser (Li et al., 2018) proposed a Seq2seq architecture to perform the end-to-end dependency parsing. The approach represented the entire dependency parsetree of a given input-sentence, as a relative headposition tag-seq (of same length as the length of the input sentence). Figure 1 depicts a labelled and an unlabelled parse-tree represented by their re- Figure 1: Examples of dependency parse tree being represented as relative head-position tag sequence by (Li et al., 2018) Figure 2: Example Enhanced Dependency Parse trees represented as Relative Head-position tag-sequences spective relative head-position tag-sequences. Subsequently, the approach trains a standard LSTMbased model to predict the relative head-position tag for each token within an input-sentence. Results outlined in the paper show that this end-to-end parser performs as well as the state-of-the-art deep biaffine network (Dozat and Manning, 2016) while being much simpler in design.

UDify
UDify is an mBERT based multilingual model which simultaneously performs four key languageprocessing tasks; these tasks are UPOS-tagging, UFeat-tagging, Lemmatization and Dependency Parsing, in a multitasking framework. The model utilizes a single shared mBERT based encoder, and four individual task-specific decoders, for each of the four tasks respectively.
The mBERT Encoder takes in the entire sentence as input, tokenizes it using pre-trained the Word-Piece Tokenizer (Wu et al., 2016) and subsequently outputs mBERT (Wu and Dredze, 2019) based contextualized-embeddings for each word within the input-sentence. We refer to original UDify (Kondratyuk and Straka, 2019) paper for a detailed description of the mechanism of computing/finetuning such contextualized embeddings.
The decoders for both the UPOS-tagging and UFeat-tagging tasks adopt a standard sequencetagging architecture with a softmax layer on the top. These decoders accept the contextual embeddings generated from the mBERT Encoder for each word in the input sentence, and predicts its UPOS/Ufeats tag.
For the Lemmatization task as well, the model uses a standard sequence-tagger which predicts a class-tag representing a unique edit script, for each word. An edit-script is simply the sequence of character operations to transform a word form to its lemma-form.
For dependency-parsing, the model adopts the popular deep biaffine architecture (Dozat and Manning, 2016) for graph-based parsing, with LSTMencoder been replaced by the shared mBERT En- coder. 3 mBERT based Seq2seq ED Parser Figure 2b depicts the architecture of the proposed ED parser. Our proposed End-to-end ED Parser is an extension of the UDify (Kondratyuk and Straka, 2019) model described in section 2.2, with one additional component namely the Relative Head Sequence predictor which predicts the relative headposition of the tag-sequence representing the unlabelled enhanced-dependency parse-tree of the input sentence (as the fifth auxiliary task in the multitasking UDify model).

Hyper-parameter
3.1 ED parse-tree as relative head-position tag sequence Given a sentence of length T, its unlabelled ED parse-tree can be represented by a relative-head tag-seq of lengthT such thatT ≥ 2T + 1. Figure  2 depicts the representations of sample unlabelled enhanced-dependency parse-trees as their relative sequences of relative head-position tags. Here, the tag < b > represents the next-token whose heads are pointed by the subsequently predicted relativehead position tags (until the next < b > tag is predicted).

Relative Head Sequence predictor
As evident in Figure 2b, our Relative Head Sequence predictor is a standard LSTM based Seq2seq neural-network (Sutskever et al., 2014) which takes in the entire input-sentence encoding vector as input, and sequentially predicts the relative head-position tag-sequence, one tag at a time.

Input sentence-encoding
The sentence-encoding e X ∈ R d of any input sen- Here BERT (X) is the output embedding-vector from the UDify's shared mBERT encoder for the end-of-sentence token < /s > of input-sentence and T Y l is a Linguistic-typology vector of language l being parsed. Each value within T Y l represents a single typology-feature from WALS (Haspelmath, 2009) database having a specific integer value. Equation 1 involves the concatenation of the BERToutput and the Typology vectors, followed by dimension reduction through a feed-forward network.
Feeding typology features together with the input sentence could improve the cross-lingual transferring ability of the multilingual model, as shown by (Ammar et al., 2016). For the proposed model, we use all the wordorder and constituency features in WALS (Haspelmath, 2009) database excluding trivially redundant features as excluded by (Takamura et al., 2016).

Training
We trained our mBERT based Seq2seq ED Parser on a single large joint-polyglot corpus, created by concatenating all the treebanks available in the training dataset provided for the IWPT 2021 Shared task.  Before each training epoch, we randomly shuffle all sentences in our polyglot training corpus, and subsequently feed mixed batches of sentences from this shuffled corpus into the model being trained, where each batch may contain sentences from any language or treebank (as done by authors of UDify (Kondratyuk and Straka, 2019)).
We optimized the weights of our multitasking model by minimizing the total loss as the sum of sparse cross-entropy losses for all five tasks namely UPOS-tagging, UFeat-tagging, Lemmatization, Dependency Parsing and Relative Head-position Sequence prediction.

Predicting
The ED parsing of any unknown input-sentence X = x 1 , x 2 , ...x T can be performed by extracting the most probable correct relative head-position tag-sequence. The correct relative head-position tag-sequence would satisfy following constraints.
1. Sequence should start with < b > and end with < end >.
2. For each word in x i ∈ X, the relative headposition tag assigned to it should be within the range of the sentence. For example, within the sentence "the house in front of the hill", the word 'the' can not have tags L 2 , L 3 , L 4 , L 5 , L 6 and the word 'hill' can not have any right tags, as these are outside the range of the sentence.
3. The label sequence should not generate any cycles within the dependency tree.
4. One of the words should have the head at < root > token.   We used dynamic programming with beam-search to efficiently extract the most probable relative head-position tag-sequence which satisfies the above listed relative head-position tag-sequence, out of all possible sequences.

Label Predictor
Figure 2c depicts the architecture of our Label predictor model. It is an mBERT based multi-class classifier with a softmax layer on top. The model takes as input the token-seq segment from the input sentence ranging from head to tail, as well as its corresponding predicted POS-tag sequence. The model outputs the probabilities of all possible ED dependency labels to be assigned to the given relation.
The Label-predictor is trained on all ED relationships available in training dataset for IWPT 2021 Shared task. The parameters of the mBERT encoder of our Label predictor are initialized with the parameters of the fine-tuned mBERT encoder of our Relative Head-position tag-sequences.

Experiments
As already explained, our proposed End-to-end Seq2seq ED-parser is trained on a large joint polyglot corpus created by concatenating all the treebanks in the provided training dataset for IWPT 2021 Shared Task. We evaluated our parser on test corpora provided for the IWPT 2021 Shared Task in eight distinct languages namely Bulgarian, Estonian, English, Latvian, Lithuanian, Russian, Slovak and Swedish. We outline the results achieved by our proposed model in detail in Section 5. Table 1 outlines hyper-parameters used in the experiments. These values are obtained by minimizing the training loss for English-EWT Corpus provided in the dev dataset provided for IWPT 2021 Shared Task. Table 2 outlines results achieved by our proposed End-to-end BERT Based Se2seq ED-Parser on all eight blind test-corpora on which the model is evaluated, as calculated by the evaluation script for the shared task.

Results and Conclusion
Appendix A compares the results achieved by our ED-parser with the results achieved by the other participants of IWPT 2021 Shared tasks. Table 3 outlines the average results achieved by all the models proposed in IWPT 2021 Shared task for all eight test-languages. It is evident that our models performs on par with other state-of-the-art ED-parsers despite the fact that its much simpler in design as it is an end-to-end design, and thus is much easier to train and implement.

A Results
This section compares the results achieved by our ED-parser with the results achieved by the other participants of IWPT 2021 Shared tasks.