A Modest Pareto Optimisation Analysis of Dependency Parsers in 2021

We evaluate three leading dependency parser systems from different paradigms on a small yet diverse subset of languages in terms of their accuracy-efficiency Pareto front. As we are interested in efficiency, we evaluate core parsers without pretrained language models (as these are typically huge networks and would constitute most of the compute time) or other augmentations that can be transversally applied to any of them. Biaffine parsing emerges as a well-balanced default choice, with sequence-labelling parsing being preferable if inference speed (but not training energy cost) is the priority.


Introduction
The inefficiency of modern NLP systems has recently come under scrutiny, especially regarding their large energy consumption (Strubell et al., 2019). This hasn't started a revolution, but there is some NLP work where efficiency is considered. Zhang and Duh (2020) studied different settings for neural machine translation systems, evaluating not only accuracy but also certain costs such as inference time, training time, and model size. Zhou et al. (2021) analysed the fine-tuning and inference time for pretrained LMs, and estimated the cost of pretraining. Jacobsen et al. (2021) presented a Pareto optimisation analysis for POS taggers, considering accuracy and model size.
In parsing in particular, Strzyz et al. (2019) evaluated dependency parsing as sequence labelling specifically to increase inference efficiency and also undertook a Pareto optimisation analysis. Others used model compression via distillation to increase inference speed of neural parsers with a mixed bag of results (Dehouck et al., 2020;Anderson and Gómez-Rodríguez, 2020a). Dehouck et al. (2020) also took into consideration the training energy costs of distilling models, which highlighted the high energy cost of this technique.
We present a Pareto optimisation analysis on modern dependency parsing systems. We cover three systems which are broadly representative of current approaches. We analyse their efficiency with respect to inference speed and also their training cost, measured in energy consumption.
Contribution: A simple, modest analysis on the merits of different parser systems that cover three current paradigms. Our goal is not to provide surprising results, but a realistic snapshot of the current state of affairs of a representative sample of modern parsing systems on linguistically diverse data. This analysis runs the systems in a consistent way with respect to software, hardware, and network settings. We also offer a brief overview of self-reported performance on PTB for systems that have a published speed. We add to this measurements for a subset of these systems which we ran locally for a more consistent comparison, i.e. something of a reproducibility effort.
Disclaimer We make a practical comparison for practitioners, so we focus on publicly available systems on typical hardware that doesn't require a huge budget. We are not making general claims that technique X is always more efficient than technique Y in the abstract or that this will hold in any hardware. Also, the extent to which an implementation has been engineered will impact performance, so we have referenced the original repositories used. 1

PTB performance
For historical reasons, it is common practice for parsers to report performance results on the English Penn Treebank (PTB) (Marcus and Marcinkiewicz, 1993). While such results at best provide a partial picture on a single language, they are by far the most comprehensive source of results provided in the literature under a consistent context (at least in terms of data and splits, although not hardware), so they are useful to see high-level trends and as a starting point to choose parsers for our experiment.
In Table 1 we report performance of modern parsing systems for which speeds have been reported. We couldn't find a reported speed of Clark et al. (2018) which currently has the highest reported performance on PTB (UAS 96.61 and LAS 95.02) when not using BERT. However, its main contribution is semi-supervised augmentations that could be utilised by any parsing system, with their core parser being the Biaffine parser. Zhou and Zhao (2019)'s system leverages constituency and dependency parsing and when not using training data with both constituency and dependency annotations (often not available) the system achieves UAS 95.82 LAS 94.43 (i.e. very similar in LAS to the other top-performing sytems). Zhang et al. (2020a) use a Biaffine parser but with a moderate beam search, which is obviously less efficient than the original. It results in a small increase in performance. Ji et al. (2019) use graph neural networks to learn enriched high-order information from partial parses. It again only gains small increases over Biaffine, but is more computationally complex and code is not available.
We report results for UUParser of Smith et al.
(2018) that we ran locally (refreshingly the original paper didn't use PTB). While the results show a reasonable speed-accuracy trade-off, we opted not to use this for the current analysis as the original code is implemented in DyNet which doesn't properly support CUDA, and is a different framework from that of the other parsers we opted to choose. Based on this, we opted to use the basic Biaffine parser to represent graph-based parsers, the Pointer-LR network as the representative of transition-based algorithms, 2 and the sequencelabelling parser to represent SL systems. They all have the added benefit of working under the same software and having code available.
Note that, as we make emphasis on efficiency, we focus on reasonably bare-bones versions of the parsers. The impact of pretrained language models, or other augmentations that are transversal to the parsing system, is outside the scope of this paper. and use one of three paradigms broadly speaking: one is a transition-based parser, one is a sequencelabelling parser, and the last is a graph-based parser. For space reasons, we only very briefly outline them here, but give more details in Appendix A.
Left-to-right pointer network (L2R). One of the current top-performing parsers on PTB, it uses a left-to-right transition-based algorithm that builds a number of attachments equal to sentence length using a pointer network (Ma et al., 2018;Fernández-González and Gómez-Rodríguez, 2019). 3 Deep biaffine (BIAFFINE) (Dozat and Manning, 2017) is an edge-factored graph-based parser that produces a matrix of scores giving a probability distribution on arcs, where the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967) is then applied to obtain a tree.
Sequence labelling parser (SEQLAB) encodes trees as a sequence of labels, so that a direct oneto-one prediction can be made for each token in a sentence (Spoustová and Spousta, 2010;Li et al., 2018b;Strzyz et al., 2019). 4 We implement it using the Biaffine system described above (for uniformity) editing it to be a sequence-labelling system.

Data
In our choice of treebanks, we balance three factors: the need to use a small number of treebanks (as our detailed Pareto analysis implies training a large number of models per treebank), linguistic diversity and treebank quality. This leads us to choose 4 high-quality (manually annotated or corrected, and relatively large) treebanks covering 3 different language families and 4 subfamilies: UD-Hindi-HDTB, UD-Polish-PDB, UD-Korean-Kaist and the Chinese Penn Treebank. More details of each treebank, justifying their diversity and adequacy for the analysis are given in Appendix C.

Methodology
We vary the size of the BiLSTM component of the networks by their number of layers and nodes. Each parser has randomly-initialised character embeddings and pretrained word embeddings as only inputs. We use pretrained FastText embeddings (Grave et al., 2018). Except for Chinese, as the FastText embeddings are in the traditional script, so we use the embeddings from Li et al. (2018a). 5 The embeddings are reduced to 100 dimensions using PCA. The structure of the networks are very similar. The L2R system uses a biaffine transformation to score the transitions at each step similar to the BIAFFINE parser, and we use the same sizes for the layers. The SEQLAB system is altered from the BIAFFINE implementation and is exactly the same except the layers needed for the biaffine transformation are replaced by two MLPs which predict the labels for each token. The only major difference in the networks is that L2R uses a CNN to create the character embeddings and the other two use BiLSTMs. We didn't change this in order to avoid modifications to the systems. The network hyperparameters are shown in Table 2 in Appendix B. Models were trained on GPU, but we report the energy used by both the GPU and CPU.
We could have altered other aspects of the network, but the main computational cost comes from the BiLSTM layer. The other main contender to alter would be the embedding layers. For example, we could have altered the size of the character BiLSTM/CNN, but certain experiments show that it has a limited impact on accuracy (Smith et al., 2018; Anderson and Gómez-Rodríguez, 2020b).
We measured the speed of each system on each treebank by running them 5 times using a single CPU core, both for speeds measured running on GPU and CPU, so that we get a reasonably accurate measure of the speed for each treebank. We then report macro averaged speeds across treebanks.
We use the energyusage package for measuring training energy. 6 It measures the power usage of the GPU and CPU while a process is running (having taken a measure of the background usage). We minimised the use of the system when training these models to obtain accurate measurements, but they aren't overly precise. This isn't a major issue as the measurements are over long periods of time and so unless there were massive fluctuations when training a given model, comparison is fine. We use joules (or kJ and MJ) as they are the SI units for energy (BIPM, 2019) and, unlike carbon emissions, they are independent of external factors like regional electricity generation grids. Hardware: Intel Core i7-7700 and Nvidia GeForce GTX 1080. Software: Python 3.7.0, PyTorch 1.0.0, and CUDA 8.0.  Figure 1 shows LAS versus parsing speed for the development data (we also present the same for the test data in Figure 7 in the Appendix that echoes the visualisation seen here). The individual Pareto front for each parser is shown (light grey, dashed) As expected, models with larger networks are more accurate but slower. More interestingly, the overall Pareto front is exclusively constructed of BIAFFINE and SEQLAB systems. While L2R does achieve similar accuracy scores as BIAFFINE, it is considerably slower. SEQLAB is the fastest option by a clear margin (especially smaller networks on CPU). So the practical advice to draw from this aspect or the Pareto optimisation would be to use BIAFFINE if accuracy is the main concern, or SEQLAB if inference time is important. Figure 2 shows LAS against the average energy (across treebanks) consumed during training (in training, we always use the GPU). There is no clear link between the energy consumed and the accuracy of a system. However, this visualisation highlights that SEQLAB is nowhere near optimal with respect to training efficiency.

Pareto fronts: training energy
The amount of energy consumed during training is basically dependent on the time it takes each system to converge as can be seen in Figure 3. In this figure, we show individual models (i.e. not averaged over treebanks). The relation for BIAFFINE and SEQLAB is very clearly linear between energy and training time, suggesting that there is nothing intrinsically more energy consuming between these systems beyond convergence time. For L2R, this relation seems to hold broadly, but is less clear. It appears that L2R is more sensitive to the nature of  the data, which we expand on in Appendix F.

Limitations of analysis
While our analysis is not ground-breaking or particularly expansive in nature, we do think it is useful in practice and acts as mini-review of the current state of affairs in dependency parsing. However, there are a number of limitations in this study. First, we only look at the parameters associated with the BiLSTMs. We feel this is fairly justified, but it is obviously feasible that varying these parameters and not the others could have different effects for each parsing system even if that is fairly unlikely. While we do look at a very diverse set of languages with diverse linguistic features, it is still a fairly small sample. We were somewhat limited by having to train many models and felt it would be better to focus on a sample of diverse languages with quality data than many languages and less model settings. Of course, this analysis could be extended to use more languages, but we expect this would further corroborate the results presented here. Also by using a small set of treebanks, we don't cover a wide array of domains (the data is mainly fiction and news).
Another potential limitation is only using one dependency annotation scheme (the scheme used for CTB was a precursor to UD), but in lieu of a theoretical reason that the parsers would behave differently using a different scheme (e.g. surface syntactic UD (SUD) treebanks containing much more non-projectivity (Gerdes et al., 2018)) this feels like a light limitation.
A slightly more pressing limitation is the absence of a feature analysis because certain systems could potentially benefit from different features. Work has been presented in this direction and has shown that predicted POS tags aren't wonderfully useful (Smith et al., 2018;Anderson and Gómez-Rodríguez, 2020b;Zhang et al., 2020b). However, these analyses didn't include SEQLAB parsers at all and the transition-based system used was a lowerperforming system, UUParser. So it is feasible that L2R and SEQLAB would benefit from predicted POS tags. That can be left open for the future.
Another limitation is that we only trained one model for each BiLSTM setting. While training a model for each treebank somewhat offset this, it is still possible that with different initialisation, these parsers would behave slightly differently. However, it is unlikely to cause material differences in the performance and as mentioned, this is quite strongly offset by training on varying treebanks.
And finally, we focused on parsers trained on fairly large amounts of annotated data. We leave the analysis of different parsing systems in a lowresource setting for others, but we point out that when training on very little data, training costs aren't much of a concern and on truly low-resource languages, data parsed at production is also going to be scarce so inference speed won't be the bottleneck.

Conclusion
We have presented a simple Pareto optimisation analysis for a representative sample of modern dependency parsers. We evaluated efficiency in two ways. We evaluated the trade-off between accuracy and parsing speed and the trade-off between accuracy and training energy consumption. The BIAFFINE and SEQLAB occupied the speed Pareto front with the former being slower and more accurate and the latter being faster and less accurate. We didn't observe any real trade-off with regards to training energy and performance, but it was clear that SEQLAB is not particularly efficient in this regard. Typically training energy varied based on how long a model took to converge, with L2R being somewhat sensitive to the different treebanks. Overall, for most scenarios, BIAFFINE emerged as a well-balanced practical solution. For the sake of candour, we offer a brief discussion of the limitations of this analysis in Appendix 4.

Appendix A Parsers
Left-to-right pointer network (L2R) is a parser which uses a left to right transition-based algorithm that builds a number of attachments equal to the length of a given sentence together with a The place had an unco' souch aboot it . 7 It is one of the current top performing parsers. We use the implementation as is, except we make moderate alterations to overcome hardcoded filepaths and the like. Otherwise, the only hyperparameter we change is the number of encoder layers and the number of nodes in the encoder and decoder layers.
Sequence labelling parser (SEQLAB) is a parsing system that first encodes trees as a set of labels, so that a direct one-to-one prediction can be made for each token in a sentence (Spoustová and Spousta, 2010;Li et al., 2018b;Strzyz et al., 2019). 8 We use the original bracketing encoding from Strzyz et al. (2019) as it doesn't require UPOS tags to decode (as the other leading encoding does), it performs closely to a more recent bracketing encoding that covers more non-projectivity (Strzyz et al., 2020), and the latter encoding wasn't publicly available when this work commenced. It casts a tree as series of tags which are made up of left and right brackets and forward and backwards slashes which encode the incoming and outgoing arcs for each respective node. The encoding for each token is based on edges associated with the preceding tokens and the direction of the edges. More formally, the encoding for w i is given by: > -if ji ∧ j < i 7 https://github.com/danifg/ SyntacticPointer 8 We use refactored encoding/decoding functions from https://github.com/mstrise/dep2label. We use the biaffine implementation described below and edit it to be a simple sequence-labelling system, i.e. an embedding layer, followed by a number of BiLSTM layers, and MLPs one for predicting the bracket tags and one for predicting the edge labels. We use the same hyperparameters as used for the biaffine parser.
Deep biaffine (BIAFFINE) is a graph-based parser that creates two representations of each token from the hidden representations from BiL-STMs, hypothesised to be a representation of each token as dependents and as heads (Dozat and Manning, 2017). 9 An affine transformation is applied to the head representation and then this and the dependent one are then combined via a second affine transformation (hence biaffine) to give a matrix of scores, which gives a probability distribution for each node representing the probability any other node is that node's head. A well-formed tree is then enforced using the Chu-Liu/Edmonds' algorithm (Chu and Liu, 1965;Edmonds, 1967). The edge labels are then predicted based on the predicted edges. We use the standard hyperparameters for this system except where we match them to better correspond to the L2R parser and then only alter the hyperparameters associated with the BiLSTMs.  sub-families and which represent different syntactic systems covering analytic, fusional, and agglutinative languages and all are written in different scripts. We offer a brief description of the treebanks used and some of the salient features of their respective languages. The treebanks were chosen to represent varying syntactic features, but also because of their high quality from being either manually annotated or manually corrected. We also chose relatively large treebanks. The statistics for each treebank are shown in Table 3.

Appendix B Network hyperparameters
UD Hindi-HDTB (Hindi) is a UD treebank for Hindi based on manually annotated news data (Palmer et al., 2009;Bhat et al., 2017). Hindi is a lightly fusional language with some degree of verbal inflection and noun declension but also makes extensive use of postpositions (McGregor, 1977). It is a split-ergative language meaning in certain cases it uses a nominative-accusative structure but in others it uses an ablative-ergative syntax where the subject of an intransitive verb behaves like the object of a transitive one (Comrie, 1978). It also exhibits tripartite behaviour in certain clauses, where the subject of intransitive verbs, the object of transitive verbs, and the subject of transitive verbs all have different case markings (Comrie, 1978). It is a SOV language, but it has a fairly free word order (Snell and Weightman, 1989). It is Indo-Iranian and is written in the Devanagari script.
UD Polish-PDB (Polish) is a UD treebank manually annotated on fiction, non-fiction, and news data (Wróblewska, 2018). Polish is a highly fusional language with a high degree of verbal inflection (Feldstein, 2001) and 7 case-markings (Wiese, 2011). It is a null-subject language (Cognola and Casalicchio, 2018) with a nominal SVO order but has relatively free word order (Siewierska, 1993). Like most Slavic languages it doesn't make use of articles (Bielec, 1998) but it does have a complex system of numeral and quantifiers that result in agreement mistmatches (Klockmann, 2012). It is a Balto-Slavic language written in the Latin script.
UD Korean Kaist (Korean) is a large treebank generated from a constituency treebank which was semi-automatically annotated with manual corrections based on academic, fiction, and news data (Choi et al., 1994;Chun et al., 2018). Korean is a strongly suffixing agglutinative language (Ramstedt, 1968;Sohn, 1999). This results in a large number of cases and a high degree of verbal inflection (Chang, 1996;Song, 1988; Lee and Ramsey,  2000). It is technically a SOV ordered languae but it has a highly flexible word order (Ramstedt, 1968;Sohn, 1999). Korean also uses honorifics and speech levels, the former encoding the social relationship between the speaker and the referents in a discussion and the latter the speaker and the person/people being spoken to (Brown, 2015). It is a Koreanic language written in the Hangul script.
Chinese Penn Treebank (Chinese) is large manually annotated treebank for Mandarin based on news data (Xue et al., 2002(Xue et al., , 2005. It is an analytic, isolating language with a SVO dominant word order and is a pro-drop language (Li and Thompson, 1981). Chinese has no grammatical tense markers so relies on context or temporal expressions, but aspect is expressed via the use of particles (Liu, 2015). Classifiers and measure words must be used when a noun is preceded by a number, a demonstrative pronoun, or certain quantifiers which are particles that appear between these qualifiers and their respective nouns (Her and Hsieh, 2010). Chinese is said to be a verb stacking language, where more than one verb or verb phrases are stacked together in the same clause, but there is some disagreement if the way verbs are combined actually constitutes verb stacking (Li and Thompson, 1981;Paul, 2008). It is a Sino-Tibetan language written in simplified Hanzi. We re-split the data because the standard split has tiny development and test sets. The resulting sizes are shown Table 3. Figure 6 shows the average training time (across treebanks) for each parser against the BiLSTM structure. There is a clear linear relation as the complexity of the BiLSTM increases. That is considering a BiLSTM with 2 layers and 1000 nodes to be less complex than one with 3 layers and 400 nodes. We also show a similar plot in the Figure  5, but against the total number of parameters in the network, which shows a similar but less clear trend.   Table 4 shows the full LAS scores for each system for each treebank with different BiLSTM configurations on the development data. Similarly, Table 5 shows the results for the test data. Figure 7 shows LAS against inference speed for the test data and echoes what was observed for the development data in Figure 1. Table 6 shows the total training energy cost, total training time, and the parameters for each parser and for each BiLSTM configuration.