Biaffine Dependency and Semantic Graph Parsing for EnhancedUniversal Dependencies

This paper presents the system used in our submission to the IWPT 2021 Shared Task. This year the official evaluation metrics was ELAS, therefore dependency parsing might have been avoided as well as other pipeline stages like POS tagging and lemmatization. We nevertheless chose to deploy a combination of a dependency parser and a graph parser. The dependency parser is a biaffine parser, that uses transformers for representing input sentences, with no other feature. The graph parser is a semantic parser that exploits a similar architecture except for using a sigmoid crossentropy loss function to return multiple values for the predicted arcs. The final output is obtained by merging the output of the two parsers. The dependency parser achieves top or close to top LAS performance with respect to other systems that report results on such metrics, except on low resource languages (Tamil, Estonian, Latvian).


System Overview
The shared task 2021 aims specifically at performing enhanced dependency parsing, starting from raw text, in a multi-language setting consisting of seventeen languages Bouma et al. (2021).
We concentrate on the syntactic parsing and enhancement stages, by exploiting existing tools for tokenization, sentence splitting.

Syntactic parsing
State of the art dependency parsers currently often adopt the graph-based model, based on neural networks for the choice of arcs and labels.
In particular the Bi-LSTM-based deep biaffine neural dependency parser by Dozat and Manning (2017) has been quite popular and used in three out of five of the top submissions to the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2018), in particular in the top non-ensemble submission (Kanerva et al., 2018).
We trained our own models for each language on the shared task treebanks using DiaParser, which uses the Stanza tokenizer and multi-word splitter.

DiaParser
DiaParser is a dependency parser derived from Supar 1 , which exploits transformers to obtain contextualized word representations. Such representations are obtained by first applying the specific transformer tokenizer, splitting them into wordpieces, and then the embeddings for words is obtained as the average of the wordpiece embeddings.
The code for the parser is available on GitHub 2 . We exploit the idea to provide hints to the parser, obtained from structural syntax probes (Hewitt and Manning, 2019). We explored the idea to use a syntax probe to extract hints for the parser to estimate the most likely edges for the parse tree. Eventually a quite simple solution proved effective: to extract values from one of the attention layers of the transformer (typically layer 6) and add them to the score of the biaffine layer with a trainable weight alpha.
One may consider a transformer as computing three functions, the outputs T o : R n×d → R n×d , the hidden states T h : R n×d → R L×n×d , and the attention weights T a : R n×d → R H×L×n×n for H heads and for L layers.
Given a sentence with n words w = [w 1 , w 2 , ..., w n ], we feed the parser with E = [e 1 , ..., e n ], where e i = mix l (T o (w)) i is the scalar mix of the top l layers of the outputs of the transformer T applied to w (Liu et al., 2019a).
The attentive parser estimates the probability of each possible arc for sentence w as follows: (1) where α is a learned weight and A are the attention weights of the transformer T for a given layer l and the given head h.
During prediction the syntactic parser applies the Chu-Liu-Edmonds algoritm (Chu, 1965;ED-MONDS, 1967) to ensure the well-formedness of the parse tree, but only after a quick check that the arcs contain cycles.
The results we obtained with such an extension on the English development corpus where 92.21 UAS and 90.31 LAS, using Electra (Clark et al., 2020) as transformer as well as for attention, a small improvement with respect to 91.32 UAS and 89.33 LAS without using these features.

Semantic Graph Parser
The graph parser uses the approach of Dozat and Manning (2018).
The graph parser shares the same architecture as the biaffine dependency parser, except in for using a sigmoid cross entropy loss function instead of a softmax, to allow for multiple results. Those arcs with a logit value greater than zero are retained.
The scores of each pair of words in w can be decoded into a graph by keeping only edges that received a positive score. Labels are assigned to each such predicted edge, choosing the highestscoring label for that edge.
The two losses of the edge and arc labels predictors are combined through an hyper-parameter λ ∈ {0, 1}: The methods does not ensure a fully connected graph, hence we merge it with the tree produced by the syntactic parser.
The final enhanced dependency arcs are obtained as the union of the arcs predicted by the syntactic and semantic parsers, with a check that no extra arcs to the root are introduced.

Tokenization
DiaParser exploits the Stanza tokenizer and multiword splitter to perform sentence splitting, tokenization and multi-word splitting. It automatically downloads tokenizer models for each language from the Stanza repository. We trained a specific MWT model for Italian, trained on the Italian UD treebank Italian ISST, augmented with a special list of sentences, representative of 75 categories of verb conjugations and of articled prepositions, which we contributed back to the official Stanza distribution.

Experiments
The syntactic and semantic parsers were trained separately on each language corpus, using language specific transformer models, where available. For languages with more than one corpus, they were just concatenated together into a single corpus.

Experimental Settings
In training, we used the official train and gold development sets. We used the development set to select the model hyper-parameters based on LAS for the dependency parser and labeled F1 on enhanced dependencies for the semantic graph parser.
We use a batch size of 2000 tokens with the AdamW (Loshchilov and Hutter, 2019) optimizer. The hyper-parameters of our system are shown in   Table 2, which are mostly adopted from previous work on dependency parsing.

Results
The official results are those labeled unipi-smax in our submission, obtained through merging the outputs of the dependency and semantic graph parser. Table 3 shows our team official results obtained in tokenization, tagging, parsing and enhancement on the test sets.

Pretrained Multilingual Model
After the submission deadline, we experimented building a single model on the concatenation of the training corpora of all languages. The corpora was preprocessed to eliminate empty nodes, which represent implicit nodes, denoted with IDs such as 2.1 in the CoN-LLU file format. We used the official script enhanced collapse empty nodes.pl, which collapses graphs reducing such empty nodes into non-empty nodes and introducing new dependency labels.
We used the official script to collapse graphs through reducing such empty nodes into non-empty nodes and introducing new dependency labels. In the post-process, we add empty nodes according to the dependency labels. As the official evaluation only score the collapsed graphs, such a process does not impact the system performance.
Then the enhanced dependency labels in the training corpus were de-lexicalized, stripping lexical information from labels, like in (Grünewald and Friedrich, 2020), replacing them with placeholders (e.g. obl:[case]) indicating where in the dependency graph the lexical information is expected to be found. This process allowed us to reduce the total number of enhanced dependency labels from 6125 to 1282.
This also made it possible to fit the model to be trained into the 32GB of memory of our V100 GPU. We run the training in parallel on 4 such GPUs: each epoch took about 45 minutes and run for 29 epochs.
The model was trained using contextualized word embeddings from RoBERTa (Liu et al., 2019b), more precisely xlm-roberta-large from HuggingFace 3 using a scalar mixture of the top 4 hidden layers (Liu et al., 2019a).
Then the model was fine tuned on each language with its specific language corpus. The enhanced dependency labels in the output of the parser are converted back to their lexical notation using a heuristic processing similar to the one outlined in (Grünewald and Friedrich, 2020): Furthermore, for languages that have case morphology, like Czech, the case is added to the label.
The multilingual model does provide significant improvements for languages with smaller corpora, in particular Latvian, Lithuanian and Tamil, as shown in Table 4: Notably Lithuanian improves on EULAS by 6.75 points. The ELAS scores do not improve as much, possibly due to the ri-lexicalization algorithms that may need tuning to each language.

Conclusions
We experimented using two parsers with the same architecture to perform syntactic and semantic parsing. We first trained parser models on the specific corpus for each language. The final output is obtained by merging the outputs of the two parsers. This simple approach works reasonably well for languages with large enough corpora.
To address the difficulty in handling low resource languages, we explored building a single model trained on all corpora and fine tuning it on each specific corpora. Since enhanced dependency labels contain lexical parts and the number of such labels is quite large, we adopted a preprocessing step to de-lexicalize the labels. The approach gave promising results on some languages, but the backconvertion algorithm that introduces the lexical parts in the labels after parsing still needs to be improved.
Given the similarity of the architectures of the syntactic and semantic parsers, the prospect of performing joint training is promising and has been