COMBO: State-of-the-Art Morphosyntactic Analysis

We introduce COMBO – a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. It predicts categorical morphosyntactic features whilst also exposes their vector representations, extracted from hidden layers. COMBO is an easy to install Python package with automatically downloadable pre-trained models for over 40 languages. It maintains a balance between efficiency and quality. As it is an end-to-end system and its modules are jointly trained, its training is competitively fast. As its models are optimised for accuracy, they achieve often better prediction quality than SOTA. The COMBO library is available at: https://gitlab.clarin-pl.eu/syntactic-tools/combo.


Introduction
Natural language processing (NLP) has long recognised morphosyntactic features as necessary for solving advanced natural language understanding (NLU) tasks. An enormous impact of contextual language models on presumably all NLP tasks has slightly weakened the importance of morphosyntactic analysis. As morphosytnactic features are encoded to some extent in contextual word embeddings (e.g. Tenney et al., 2019;Lin et al., 2019), doubts arise as to whether explicit morphosyntactic knowledge is still needed. For example, Glavaš and Vulić (2021) have recently investigated an intermediate fine-tuning contextual language models on the dependency parsing task and suggested that this step does not significantly contribute to advance NLU models. Conversely, Warstadt et al. (2019) reveal the powerlessness of contextual language models in encoding linguistic phenomena like negation. This is in line with our intuition about representing negation in Polish sentences (see Figure 1). It does not seem trivial to differentiate between the con-tradicting meanings of these sentences using contextual language models, as the word context is similar. The morphosyntactic features, e.g. parts of speech PART vs. INTJ, and dependency labels advmod:neg vs. discourse:intj, could be beneficial in determining correct reading.
In order to verify the influence of explicit morphosyntactic knowledge on NLU tasks, it is necessary to design a technique for injecting this knowledge into models or to build morphosyntax-aware representations. The first research direction was initiated by Glavaš and Vulić (2021). Our objective is to provide a tool for predicting high-quality morphosyntactic features and exposing their embeddings. These vectors can be directly combined with contextual word embeddings to build morphosyntactically informed word representations.
The emergence of publicly available NLP datasets, e.g. Universal Dependencies (Zeman et al., 2019), stimulates the development of NLP systems. Some of them are optimised for efficiency, e.g. spaCy (Honnibal et al., 2020), and other for accuracy, e.g. UDPipe (Straka, 2018), the Stanford system (Dozat and Manning, 2018), Stanza (Qi et al., 2020). In this paper, we introduce COMBO, an open-source fully neural NLP system which is optimised for both training efficiency and prediction quality. Due to its end-to-end architecture, which is an innovation within morphosyntactic analysers, COMBO is faster in training than the SOTA pipeline-based systems, e.g. Stanza. As a result of applying modern NLP solutions (e.g. contextualised word embeddings), it qualitatively outperforms other systems.
COMBO analyses tokenised sentences and predicts morphosyntactic features of tokens (i.e. parts of speech, morphological features, and lemmata) and syntactic structures of sentences (i.e. dependency trees and enhanced dependency graphs). At the same time, its module, COMBO-vectoriser, extracts vector representations of the predicted fea- tures from hidden layers of individual predictors. COMBO user guide is in §4 and a live demo is available on the website http://combo-demo.nlp. ipipan.waw.pl.
Contributions 1) We implement COMBO ( §2), a fully neural NLP system for part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing, together with COMBO-vectoriser for revealing vector representations of predicted categorical features. COMBO is implemented as a Python package which is easy to install and to integrate into a Python code. 2) We pre-train models for over 40 languages that can be automatically downloaded and directly used to process new texts. 3) We evaluate COMBO and compare its performance with two state-of-the-art systems, spaCy and Stanza ( §3).

COMBO Architecture
COMBO's architecture (see Figure 2) is based on the forerunner (Rybak and Wróblewska, 2018) implemented in the Keras framework. Apart from a new implementation in the PyTorch library (Paszke et al., 2019), the novelties are the BERT-based encoder, the EUD prediction module, and COMBO-vectoriser extracting embeddings of UPOS and DEPREL from the last hidden layers of COMBO's tagging and dependency parsing module, respectively. This section provides an overview of COMBO's modules. Implementation details are in Appendix A.
Local Feature Extractors Local feature extractors (see Figure 2) encode categorical features (i.e. words, parts of speech, morphological features, lemmata) into vectors. The feature bundle is configurable and limited by the requirements set for COMBO. For instance, if we train only a dependency parser, the following features can be input to COMBO: internal character-based word embeddings (CHAR), pre-trained word embeddings (WORD), and embeddings of lemmata (LEMMA), parts of speech (UPOS) and morphological features (UFEATS). If we train a morphosyntactic analyser (i.e. tagger, lemmatiser and parser), internal word embeddings (CHAR) and pre-trained word embeddings (WORD), if available, are input to COMBO. Words and lemmata are always encoded using character-based word embeddings (CHAR and LEMMA) estimated during system training with a dilated convolutional neural network (CNN) encoder (Yu and Koltun, 2016;Strubell et al., 2017).
Additionally, words can be represented using pretrained word embeddings (WORD), e.g. fastText (Grave et al., 2018), or BERT (Devlin et al., 2019. The use of pre-trained embeddings is an optional functionality of the system configuration. COMBO freezes pre-trained embeddings (i.e. no fine-tuning) and uses their transformations, i.e. embeddings are transformed by a single fully connected (FC) layer.
Part-of-speech and morphological embeddings (UPOS and UFEATS) are estimated during system training. Since more than one morphological feature can attribute a word, embeddings of all pos-sible features are estimated and averaged to build a final morphological representation.
Global Feature Encoder The encoder uses concatenations of local feature embeddings. A sequence of these vectors representing all the words in a sentence is processed by a bidirectional LSTM (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005). The network learns the context of each word and encodes its global (contextualised) features (see Figure 3). Global feature embeddings are input to the prediction modules.
ROOT The car is red Figure 3: Estimation of global feature vectors.

biLSTM GLOBAL
Tagging Module The tagger takes global feature vectors as input and predicts a universal part of speech (UPOS), a language-specific tag (XPOS), and morphological features (UFEATS) for each word. The tagger consists of two linear layers followed by a softmax. Morphological features build a disordered set of category-value pairs (e.g. Num-ber=Plur). Morphological feature prediction is thus implemented as several classification problems. The value of each morphological category is predicted with a FC network. Different parts of speech are assigned different sets of morphological categories (e.g. a noun can be attributed with grammatical gender, but not with grammatical tense). The set of possible values is thus extended with the NA (not applicable) symbol. It allows the model to learn that a particular category is not a property of a word.
Lemmatisation Module The lemmatiser uses an approach similar to character-based word embedding estimation. A character embedding is concatenated with the global feature vector and transformed by a linear layer. The lemmatiser takes a sequence of such character representations and trans-forms it using a dilated CNN. The softmax function over the result produces the sequence of probabilities over a character vocabulary to form a lemma. Parsing Module Two single FC layers transform global feature vectors into head and dependent embeddings (see Figure 4). Based on these representations, a dependency graph is defined as an adjacency matrix with columns and rows corresponding to heads and dependents, respectively. The adjacency matrix elements are dot products of all pairs of the head and dependent embeddings (the dot product determines the certainty of the edge between two words). The softmax function applied to each row of the matrix predicts the adjacent headdependent pairs. This approach, however, does not guarantee that the resulting adjacency matrix is a properly built dependency tree. The Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967) is thus applied in the last prediction step. The procedure of predicting words' grammatical functions (aka dependency labels) is shown in Figure 5. A dependent and its head are represented as vectors by two single FC layers. The dependent embedding is concatenated with the weighted average of (hypothetical) head embeddings. The weights are the values from the corresponding row of the adjacency matrix, estimated by the arc prediction module. Concatenated vector representations are then fed to a FC layer with the softmax activation function to predict dependency labels.
EUD Parsing Module Enhanced Universal Dependencies (EUD) are predicted similarly to dependency trees. The EUD parsing module is described in details in Klimaszewski and Wróblewska (2021).

COMBO Performance
Data COMBO is evaluated on treebanks from the Universal Dependencies repository (Zeman et al., 2019), preserving the original splits into training, validation, and test sets. The treebanks representing distinctive language types are summarised in Table 4 in Appendix B.
By default, pre-trained 300-dimensional fastText embeddings (Grave et al., 2018) are used. We also test encoding data with pre-trained contextual word embeddings (the tested BERT models are listed in Table 5 in Appendix B). The UD datasets provide gold-standard tokenisation. If BERT intra-tokeniser splits a word into sub-words, the last layer embeddings are averaged to obtain a single vector representation of this word. Table 1 shows COMBO results of processing the selected UD treebanks. 1 COMBO is compared with Stanza (Qi et al., 2020) and spaCy. 2 The systems are evaluated with the standard metrics (Zeman et al., 2018): F1, UAS (unlabelled attachment score), LAS (labelled attachment score), MLAS (morphology-aware LAS) and BLEX (bi-lexical dependency score). 3 COMBO and Stanza undeniably outrun spaCy models. COMBO using non-contextualised word 1 Check the prediction quality for other languages at: https://gitlab.clarin-pl.eu/syntactic-tools/ combo/-/blob/master/docs/performance.md.

Qualitative Evaluation
2 https://spacy.io We use the project template https://github.com/explosion/projects/tree/ v3/pipelines/tagger_parser_ud. The lemmatiser is implemented as a standalone pipeline component in spaCy v3 and we do not test it.
3 http://universaldependencies.org/conll18/ conll18_ud_eval.py (CoNLL 2018 evaluation script). embeddings is outperformed by Stanza in many language scenarios. However, COMBO supported with BERT-like word embeddings beats all other solutions and is currently the SOTA system for morphosyntactic analysis.
Regarding lemmatisation, Stanza has an advantage over COMBO in most tested languages. This is probably due to the fact that Stanza lemmatiser is enhanced with a key-value dictionary, whilst COMBO lemmatiser is fully neural. It is not surprising that a dictionary helps in lemmatisation of isolating languages (English). However, the dictionary approach is also helpful for agglutinative languages (Finnish, Korean, Basque) and for Arabic, but not for Polish (fusional languages). Comparing COMBO models estimated with and without BERT embeddings, we note that BERT boost only slightly increases the quality of lemma prediction in the tested fusional and agglutinative languages.
For a complete insight into the prediction quality, we evaluate individual UPOS and UDEPREL predictions in English (the isolating language), Korean (agglutinative) and Polish (fusional). Result visualisations are in Appendix C.
COMBO took part in IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (Bouma et al., 2021), where it ranked 4th. 4 In addition to ELAS and EULAS metrics, the third evaluation metric was LAS. COMBO ranked 2nd, achieving the average LAS of 87.84%. The score is even higher than the average LAS of 86.64% in Table 1, which is a kind of confirmation that our evaluation is representative, reliable, and fair.
Downstream Evaluation According to the results in Table 1, COMBO predicts high-quality dependency trees and parts of speech. We therefore conduct a preliminary evaluation of morphosyntactically informed word embeddings in the textual entailment task (aka natural language inference, NLI) in English (Bentivogli et al., 2016) and Polish (Wróblewska and Krasnowska-Kieraś, 2017). We compare the quality of entailment classifiers with two FC layers trained on max/mean-pooled BERT embeddings and sentence representations estimated by a network with two transformer layers which is given morphosyntactically informed word embeddings (i.e. BERT-based word embeddings concatenated with UPOS embeddings, DEPREL embeddings, and BERT-based embeddings of the head   Tables 2 and 3), spaCy is the SOTA system, and the other two are not even close to its processing time. Considering COMBO and Stanza, whose prediction quality is significantly better than spaCy, COMBO is 1.5 times slower (2 times slower with BERT) than Stanza in predicting, but it is definitely faster in training. The reason for large discrepancies in training times is the different architecture of these two systems. Stanza is a pipelinebased system, i.e. its modules are trained one after the other. COMBO is an end-to-end system, i.e. its modules are jointly trained and the training process is therefore faster.  To download a model for another language, select its name from the list of pre-trained models. 6 The Python mode also supports acquisition of DE-PREL  If we only train a dependency parser, the default setup should be changed with configuration flags: --features with a list of input features and --targets with a list of prediction targets.

Conclusion
We have presented COMBO, the SOTA system for morphosyntacic analysis, i.e. part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. COMBO is a language-agnostic and format-independent system (i.e. it supports the CoNLL-U and CoNLL-X formats). Its implementation as a Python package allows effortless installation, and incorporation into any Python code or usage in the CLI mode. In the Python mode, COMBO supports automated download of pre-trained models for multiple languages and outputs not only categorical morphosyntactic features, but also their embeddings. In the CLI mode, pre-trained models can be manually downloaded or trained from scratch. The system training is fully configurable in respect of the range of input features and output predictions, and the method of encoding input data. Last but not least, COMBO maintains a balance between efficiency and quality. Admittedly, it is not as fast as spaCy, but it is much more efficient than Stanza considering the training time. Tested on the selected UD treebanks, COMBO morphosyntactic models enhanced with BERT embeddings outperform spaCy and Stanza models.

Acknowledgments
The authors would like to thank Piotr Rybak for his design and explanations of the architecture of COMBO's forerunner. Activation function FC and CNN layers use hyperbolic tangent and rectified linear unit (Nair and Hinton, 2010) activation functions, respectively.

A.2 Regularisation
Dropout technique for Variational RNNs (Gal and Ghahramani, 2016) with 0.33 rate is applied to the local feature embeddings and on top of the stacked biLSTM estimating global feature embeddings. The same dropout, for output and recurrent values, is used in the context of each biL-STM layer. The FC layers use the standard dropout (Srivastava et al., 2014) with 0.25 rate. Moreover, the biLSTM and convolutional layers use L2 regularisation with the rate of 1×10 −6 , and the trainable embeddings use L2 with the rate of 1 × 10 −5 .

A.3 Training
The cross-entropy loss is used for all parts of the system. The final loss is the weighted sum of losses with the following weights for each task: • 0.05 for predicting UPOS and LEMMA, • 0.2 for predicting UFEATS and (enh)HEAD, • 0.8 for predicting (enh)DEPREL.
The whole system is optimised with ADAM (Kingma and Ba, 2015) with the learning rate of 0.002 and β 1 = β 2 = 0.9. The model is trained for a maximum of 400 epochs, and the learning rate is reduced twice by the factor of two when the validation score reaches a plateau.

B External Data Summary
Tables 4 and 5 list the UD dependency treebanks and BERT models used in the evaluation experiments presented in Section 3.

C Evaluation of UPOS and UDEPREL
The comparison of the universal parts of speech predicted by the tested systems in English, Korean and Polish data is shown in the charts in Figures 6,  7 and 8, respectively. The comparison of the quality of the predicted universal dependency types in English, Korean and Polish data is presented in Figures 9, 10 and 11