Error Analysis and the Role of Morphology

We evaluate two common conjectures in error analysis of NLP models: (i) Morphology is predictive of errors; and (ii) the importance of morphology increases with the morphological complexity of a language. We show across four different tasks and up to 57 languages that of these conjectures, somewhat surprisingly, only (i) is true. Using morphological features does improve error prediction across tasks; however, this effect is less pronounced with morphologically complex languages. We speculate this is because morphology is more discriminative in morphologically simple languages. Across all four tasks, case and gender are the morphological features most predictive of error.


Introduction
In error analysis, we often blame morphology (Nivre, 2007;Bender, 2009), i.e., the productive inflection and derivation of new word forms. Morphology has been argued to be a major source of error in syntactic parsing (Tsarfaty et al., 2020), semantic parsing (Şahin and Steedman, 2018), machine translation (Irvine et al., 2013;Burlot and Yvon, 2017) and a range of other tasks, in particular in morphologically complex languages (Bender, 2009;Søgaard et al., 2018;Tsarfaty et al., 2020). This paper presents a large-scale study showing that morphology is, as commonly conjectured, an important source of error across tasks, but somewhat surprisingly, that morphology is less predictive of errors in morphologically complex languages.
English is a morphologically simple language, showing very limited inflection and expressing most concepts through syntactic structure instead; it is also the most-represented language at major natural language processing (NLP) venues and that with the largest amount of language resources available (Bender, 2011;Joshi et al., 2020). This  Figure 1: Overview of our methodology: We map each token to a set of morphological features and, based on this representation, predict whether some NLP system (e.g., a dependency parser) was correct () or made an error () on that token. makes it easy to ignore morphology when designing model architectures. As a consequence, we frequently observe that performance of NLP systems on morphologically more complex languages lags behind that for English (e.g. Czarnowska et al., 2019;Tsarfaty et al., 2020).
Complex morphology leads to the occurrence of rare inflected word forms. Polish nouns, for example, can inflect for number and seven different cases; this makes it less likely that all of these inflected word forms appear in the training data for our NLP models. Consequently, a model that correctly handles imię 'name' (NOM.SG) might not have seen the less frequent form imionami (INST.PL), potentially resulting in errors. If the model has generally seen fewer words in instrumental case, this can lead to systematic errors on this class of inflections.
Nowadays, many NLP systems use statistically learned subword units such as byte-pair encodings (Sennrich et al., 2016) or use characters as input representations, which could allow a system to generalize to individual affixes. However, in practice, these approaches are often found to be in-sufficient at capturing morphological structure (Vania and Lopez, 2017;Bostrom and Durrett, 2020;Klein and Tsarfaty, 2020).
Contributions In this study, we revisit two common conjectures about the role of morphology that are made in error analysis of NLP systems. Specifically, we ask whether (i) whether morphology is generally predictive of errors across tasks and languages; and (ii) whether the extent to which morphology is predictive depends on the morphological complexity of the language in question. These conjectures are common throughout the literature (Nivre, 2007;Bender, 2009;Manning, 2011).
Looking at data from four shared tasks on semantic role labeling (Hajič et al., 2009), dependency parsing (Zeman et al., 2018), verbal multi-word expression identification (Ramisch et al., 2018), and quality estimation (Fonseca et al., 2019), we map each token in the input data to a set of morphological features. Using only this feature set, and without using any orthographic or distributional representation of the input, we train random forest classifiers to predict whether a system has made an error on an input token. Figure 1 illustrates this approach.
Using this methodology, we find that, somewhat surprisingly, our results only support the first conjecture. In other words, (i) while morphology is helpful in predicting such errors, (ii) the degree to which morphology helps does not increase with the morphological complexity of the language. Moreover, we find and discuss task-specific differences between which morphological features are predictive of error. In general, part of speech, case and gender are most predictive of error.
The code for obtaining the datasets and running the experiments is made publicly available. 1

Background
Morphology is frequently identified as a source of error during qualitative evaluations of NLP systems. Honnibal et al. (2010) observe that inflectional variants cause problems for statistical CCG tagging due to training data sparseness, and explicit morphological analysis helps, even for English. For dependency parsing, Seeker and Kuhn (2013) identify case syncretism as a source of error propagation in data from Czech, German, and Hungarian. Tsarfaty et al. (2020) give a broader overview of the challenges that rich morphological structure presents for dependency parsing, andŞahin and Steedman (2018) discuss the importance of morphology in semantic parsing.
Many observations of the effect of morphology come from evaluating machine translation (MT) systems. Federico et al. (2014) show that morphological errors are common for MT into Arabic and Russian and strongly affect human quality judgement. For English-Romanian MT, Peter et al. (2016) find that tense and verb form on the target side are a common source of error. Klubička et al. (2017) find that errors in English-Croatian MT are more common for some morphological categories, such as case. In a similar vein, Burlot and Yvon (2017) evaluate morphological competence of MT systems using contrast pairs and show that systems have different strengths and weaknesses for different morphological phenomena. Beyond parsing and MT, morphology has also been shown to present a challenge for tasks such as Arabic handwriting recognition (Habash and Roth, 2011) or Russian anaphora resolution (Toldova et al., 2016).
Most of the studies cited above predate contextual embedding models such as BERT (Devlin et al., 2019), which are now considered state-of-the-art for many NLP tasks. So far, few studies have explicitly analysed BERT with regard to morphology. Edmiston (2020) analyses morphological content in BERT-style models for five languages and finds that "[morphological] ambiguity is negatively correlated with performance on classification, and to a significant degree in many cases", suggesting that morphology is still a significant source of error in these models. We go significantly beyond this work by studying a much larger set of morphological variables, across several architectures and tasks, and across up to 57 languages.

Datasets
We collect datasets from shared tasks that (i) publish system outputs along with their gold annotations, (ii) span a variety of languages, and (iii) cover different NLP tasks. Based on these criteria, we pick datasets from the following shared tasks: Here, we are not interested in the system outputs from the shared task; instead, we use the gold annotations for the quality estimation, which give us token-level error labels for the underlying machine translation outputs. Section 4.2 describes in detail how we assign error labels to these datasets.

Methodology
We train a classifier to predict errors made by NLP systems based on morphological features of the input tokens, in order to then analyze which morphological features (if any) are most predictive of such errors. We first describe how we obtain these features (Sec. 4.1) and how we classify when an NLP system has made an error (Sec. 4.2), then describe the classifier itself (Sec. 4.3).

Feature Extraction
We represent each token in the input data using a binary feature set. Each individual feature is named using the convention of {CATEGORY}={VALUE}, where the former is a feature category (such as POS for "part of speech") and the latter is a value within that category (e.g. VERB). We encode these features in a binary manner, i.e., for each feature in our inventory, that feature is either present or not present. Importantly, the classifier itself has no notion of "feature categories" as it only sees a single, binary feature vector. The full feature inventory is summarized in Table 1; what follows is a description of these features and how we derived them.
Morphological features Our morphological feature inventory consists of (i) Universal Dependencies (UD) features, (ii) lexical features, and (iii) string-based features.
UD features include the universal part-ofspeech (POS) category and the universal feature set as defined by Universal Dependencies; e.g. U:POS=VERB or U:TENSE=PAST. 2 The UDP shared-task gold data already provides this annotation; for the other tasks, we obtain these features by running UDPipe 3 (Straka and Straková, 2017) with the largest pre-trained model for the language in question. 4 We complement this with the following additional lexical features: (i) SYNCRETIC specifies to what extent a token can be representative for several morphological feature sets: e.g., ask can be either U:MOOD=IND or U:MOOD=IMP, depending on context; (ii) AMBIG POS specifies to what extent the universal part-of-speech tag of the token can differ based on context: e.g., book could be either U:POS=VERB or U:POS=NOUN; and (iii) AMBIG LEX specifies whether or not the token belongs to multiple lexemes: e.g., ruling is a form of both '(to) rule' and '(the) ruling'. To determine these features for a given token, we use UDLexicons 5 (Sagot, 2018); in case a language is not covered by UDLexicons, we fall back to UniMorph 6 (Kirov et al., 2018).
Finally, we define purely string-based features based on comparing the token with its lemma. We perform character-based string alignment using Edlib (Šošić and Šikić, 2017) and derive the following features: (i) EDIT=PRE and EDIT=SUF when there is an edit at the beginning or the end of the sequence, respectively; (ii) EDIT=IN when there is an edit in the middle of the sequence; and (iii) EDIT=FULL when there is no character alignment between the strings. These features are intended to approximate prefixation, suffixation, infixation or other word-internal processes, and suppletion, respectively.

Control features
To estimate the relative importance of our morphological features for the error prediction task, we additionally introduce a set of control features that are not morphologically motivated (cf. Tab. 1). These are (i) string length fea- where Mt is the set of all observed morphological feature combinations for t is the sequence of edit alignments between t and l, tures, where each token is assigned exactly one such feature depending on its length; and (ii) token frequency bins. For the latter, we count token frequencies in the Universal Dependencies treebanks and assign each token a frequency feature. These features are based on frequency bins that we manually curated to provide a roughly balanced distribution of tokens to bins: e.g., FREQ=99 denotes a token that is in the 99 th percentile of the frequency distribution of all types, while FREQ=RARE denotes a token occurring less than four times overall (see Table 1 for all definitions).
Pruning and statistics Since very rare features are not very informative, for any given dataset, we remove features that occur less than 10 times in that dataset. Depending on the task and language, we generate between 17 and 120 unique features this way, with an average of 68.

Classifying errors in system outputs
The target variable for our classifier is a binary label corresponding to whether or not the shared-task system has made an error on the input token. This requires comparing the outputs of a system to the gold data and classifying each token as either correct or incorrect. We will also refer to the latter as the error class. This classification follows the original evaluation criteria by the shared tasks to the extent possible.
For SEM, a prediction is classified as "correct" iff the semantic dependencies and label columns are an exact match with the gold data. For UDP, we do the same with the syntactic head and dependency relation columns; this is the same criterion that underlies the labeled attachment score (LAS) commonly used to evaluate dependency parsing. VMWE is a little more challenging since its prediction involves a set of tokens within a sentence. For each sentence, we match up each gold MWE with the predicted MWE that has the same label and the largest token overlap. We then consider a token "incorrectly" predicted if has a MWE annotation that does not belong to one of these matched MWEs, or if it lacks a MWE annotation that it should have according to the gold data.
As mentioned before, we treat the MT data a little differently: here, the gold data provides binary labels in the form of "OK" and "BAD" tags, corresponding to the correctness of some machine translation system. These tags are provided both for tokens and gaps between tokens (to account for the deletion/insertion of words in machine translation). We use the token-level tags from the gold data directly as our error classification labels. Appendix A gives an example for the error classification approach on VMWE and MT.

Training classifiers
With the extracted features (from Sec. 4.1), we can now train classifiers to predict the error variable (from Sec. 4.2). Concretely, we train random forest classifiers (Breiman, 2001) as implemented by Scikit-learn 7 (Pedregosa et al., 2011) on each output file provided by each shared task. Random forests are ensembles of decision trees and are quick to train: the average training time on our datasets was 14 seconds on CPU, with no single run taking longer than five minutes.
As an alternative to random forests, we also experimented with randomized logistic regression classifiers followed by stability selection (Meinshausen and Bühlmann, 2010) to select predictive features. In our trials, this approach showed a worse performance (in terms of F 1 -score) compared to random forests, while also taking considerably longer to run (averaging 7 minutes per dataset). We therefore only report results with random forest classifiers.

Analysis
For each shared task (Sec. 3), we ran our classification pipeline (Sec. 4) separately for each combination of (i) system submission and (ii) language evaluated on. Since random forests are largely interpretable, our analysis focuses on the important features in our learned models.
First, though, we look at the overall F 1 -score of the individual classifiers, which we evaluate via stratified 5-fold cross-validation on each data point (Sec. 5.1). Additionally, to better estimate the importance of morphology, we run our crossvalidation pipeline a second time without the mor-phological features, i.e., only providing the classifiers with the "control features" shown in Tab. 1. We refer to these two feature sets as "full" and "control" settings, respectively, and analyze their differences in F 1 -score (Sec. 5.2). 8 Finally, we analyse the importance of individual morphological features (Sec. 5.3).

How well do the classifiers predict errors?
To evaluate how well the full classifiers learned the task, we consider their F 1 -score for predicting the "error" class. Across all of our datasets, we observe a mean F 1 of 0.43 with a standard deviation of ±0.18. Note that our setup is not comparable to most other NLP classification tasks: we evaluate a classifier trained to detect the errors of state-ofthe-art systems, which means that (i) the task is inherently hard, as those systems are optimized to fix easily detectable errors, and (ii) there is no reason to assume a priori that this task is well learnable from morphological input features alone. Therefore, we believe an F 1 score of 0.43-albeit with considerable variance in performance across tasks and languages-is a strong result.
Error rate There is one important aspect to consider: the frequency of the "error" class depends on the system performance of the data point we look at, and as such our class distribution can be highly imbalanced and varied. Indeed, F 1 -score and frequency of the error class correlate very strongly with Pearson's r = 0.93. Figure 2 plots this relationship. 9 This suggests that the errors introduced by state-of-the-art NLP systems, unsurprisingly, become harder and harder to predict the better the underlying systems perform. Note that data imbalance is in the nature of the error prediction task, as we expect errors in state-ofthe-art systems to be rare. Additionally, different 8 To complement the results and analyses presented here, we also provide a detailed table with the results for all task/language pairs in Appendix B. 9 It might look surprising that many data points have very high error rates, with some even going above 0.95; i.e., more than 95% of all predictions in the respective file are deemed to be "incorrect" according to the criteria in Sec. 4.2. Spot-checking reveals that this is, however, plausible: for example, in UDP, the average labeled attachment score (LAS) on the Thai TH_PUD treebank was only 1.38 (Zeman et al., 2018, Table 15), with 23 systems achieving a LAS of only 0.77 or lower (out of 100; cf. http://universaldependencies.org/ conll18/results-las.html), which is reflected by an error rate of ≥99.23% in our data. languages have differently-sized morphological tag inventories, affecting the total number of input features for the classifier. We do not attempt to apply data balancing techniques to counteract this, since this would make the task artificially easy and our results overly optimistic.

How important is morphology for predicting errors?
Figure 3 provides an alternative view of the F 1scores presented in Fig. 2, this time as a letter-value plot (Hofmann et al., 2017) showing quantiles of the F 1 distribution. Additionally, we compare the classifier with the full feature set to the control set where morphological features were not included. We observe that the classifiers learn best on UDP followed by SEM, while classifier F 1 is relatively poor on VMWE data. A probable explanation for this is the generally low error rate in VMWE (cf. Fig. 2). The other important observation is that classifiers in the "control" setting score consistently lower than the classifiers that have access to morphological features.
Importance by language For looking at individual languages, we restrict ourselves to the UDP data. Firstly, UDP covers 57 languages-more than any other task in our comparison-and there are no languages in the other tasks that are not also contained in UDP. Secondly, our classifier performance is generally highest on UDP (cf. Fig. 3), allowing for a more meaningful interpretation of results, particularly of selected features.
Furthermore, to factor out the effect of a data point's error rate (as discussed in Sec. 5.1), we look at the difference between the F 1 -score of the full classifier and the control classifier trained on the same data point. In other words, we define where g f and g c are the classifiers with the full and the control feature set, respectively. This gives us a way to judge the importance of morphological features relative to the non-morphological ones while minimizing the effect of the error rate on the results, since ∆F 1 no longer shows a strong correlation with the error rate (r = 0.29). Figure 4 (bottom half) shows the quartiles of ∆F 1 scores by language in the UDP dataset. They span a wide range of values, with the median ∆F 1 varying gradually between −0.03 (for Turkish, TUR) and 0.24 (for Nigerian Pidgin, PCM). Morphological features appear to be important for some languages while being unhelpful, and sometimes even detrimental, for others.
Morphological complexity Are the differences in ∆F 1 scores (in Fig. 4) somehow related to the morphological complexity of the languages? To analyze this relationship more systematically, we use the measure of morphological feature entropy (MFE) introduced by Çöltekin and Rama (2018). MFE is sensitive to both the size of a language's morphological feature inventory as well as its distribution, with a more uniform distribution of features resulting in a higher MFE. Since MFE is a treebank measure that relies on the association between tokens and morphological tags, it is affected by tokenization and annotation choices of the treebank used to calculate it; therefore, it can only be considered a rough approximation of the underlying language's complexity. Like Çöltekin  TUR  FIN  BXR  HYE  SRP  LAV  SME  RON  UIG  GRC  POL  EST  GLE  FAS  URD  UKR  ELL  IND  DAN  BUL  KOR  HEB  HIN  FRO  AFR  HRV  LAT  RUS  NLD  SPA  KAZ  ZHO  SLK  CAT  ARA  ITA  SLV  CES  POR  NOR  THA  HUN  KMR  FRA  JPN  ENG  EUS  VIE  CHU  DEU  FAO  HSB  SWE  GOT  GLG Figure 4: Classifier performance on UDP by language, sorted by median ∆F 1 , where ∆F 1 is the difference in F 1 -scores between training with the full and the control feature set (cf. Eq. 1). Bottom half shows the quartiles of the ∆F 1 distribution, top half shows the morphological feature entropy (MFE) for the given language; color shading is also based on MFE (with darker shade = higher MFE). Full names for all language codes as well as exact numeric values can be found in Appendix B.
and Rama (2018), we calculate the MFE score for each language on the UD treebanks. 10 The MFE score for each language is shown in the top half of Fig. 4. Surprisingly, we find a slight, negative correlation between MFE and ∆F 1 (Pearson's r = −0.24). While languages with high MFE appear across the whole range of the ∆F 1 distribution, a number of languages with low MFE-and thus deemed to be more morphologically simple, such as Thai (THA), Japanese (JPN), or Nigerian Pidgin (PCM)-are found to profit more from the inclusion of morphological features. One possible explanation is that the control features are already very strong, which we will look at more closely in Sec. 5.3. Another possible factor is that morphologically complex languages introduce a much larger set of morphological features; if, for a given language, most of them are not relevant for predicting errors in the UDP task, they might hurt the overall classifier performance.

What morphological features are most predictive of errors?
Morphological features provide a helpful signal to the classifiers, though its overall magnitude differs 10 We use UD version 2.5 (Zeman et al., 2019).
by language (cf. Sec. 5.2). Now, we ask which of the morphological features are particularly relevant for error prediction. Since plain feature importances of trained random forest classifiers can be misleading (Strobl et al., 2007;Parr et al., 2018), we follow the approach of explicitly removing features and retraining (Parr et al., 2018;Hooker and Mentch, 2019). Unlike the analyses above, we are not concerned with generalization here, but with identifying features that are especially predictive for the error variable on each dataset as a whole. Therefore, we do not use a cross-validation strategy, but rely on the full dataset for both training and obtaining feature importances. Concretely, for each feature category (as introduced in Sec. 4.1), we retrain the model without features from that category and note the drop in error-class F 1 -score compared to the model with the full feature set. Formally, let Φ be the full feature set and φ c ⊂ Φ the subset of features belonging to category c (e.g., c = U:TENSE). The importance of category c is then defined as where C X is a random forest classifier trained using feature set X. Higher values for f (c) mean a higher importance of category c, while negative values  mean that including c is actually detrimental to the F 1 -score.
Average feature importances Table 2 shows the top 10 feature categories for each task, averaged over all languages and datasets. The two control features, FREQ and LEN, always appear among the three most important categories, only trumped by U:POS for the UDP and SEM tasks. Notably, these three are the only feature categories that are guaranteed to appear with every token. It is no surprise that token frequency is strongly related to the likelihood of errors, while Zipf's law tells us that token length is strongly correlated with frequency. Figure 5 shows the distribution of feature importances for the top 10 categories of UDP (cf. Tab. 2b). U:POS spans a much wider range of FI values than any of the other categories, although the outliers at the upper end all come from Nigerian Pidgin (PCM). Moreover, categories with a low average FI (e.g., U:ASPECT or SYNCRETIC) do not show outliers, i.e., are of low importance across languages. This is also true for the remaining feature categories.
Individual part-of-speech tags Since U:POS is an important feature category across tasks (cf. Tab. 2), we also look at feature importances for individual POS tags. For this, we use the same approach as for the feature categories (cf. Eq. 2), except that we now only remove a single U:POS feature from Φ at a time. Table 3 shows the average feature importances for individual U:POS features, though this time we restrict ourselves to the subset of languages in UDP that are also covered in SEM. 11 This way, we can better isolate the task-specific differences in FI scores, without conflating them with the dif-11 These are Catalan, Czech, German, English, Japanese, Spanish, and Chinese; cf. Appendix B. ferent language-specific distributions of part-ofspeech tags that may affect these results. We find that adverbs (ADV) are the most important partof-speech category for both tasks, while INTJ and PART are found to be important for predicting errors in UDP, but not in SEM. This aligns with our intuitions about what is hard in syntactic and semantic parsing, further supporting the validity of our approach.

Conclusion
We presented a large-scale error analysis focusing on the role of morphology. Our analysis spans a range of morphological variables, four NLP tasks, and up to 57 languages. We confirm the common conjecture that morphological variablesespecially case and gender-are predictive of errors across NLP tasks and languages. Somewhat sur-  prisingly, we found that the usefulness of morphological variables is negatively correlated with the morphological complexity of the language in question. We speculate this is because morphological information is more discriminative in morphologically simple languages.  A Examples for error classification Table 4a shows an example for how we classify errors (cf. Sec. 4.2) in the VMWE dataset on verbal multi-word expression (MWE) identification.
In the gold data, a single MWE ('postawienie sprawy') is annotated, while the NLP system has incorrectly identified the MWE as being 'to . . . postawienie'. The annotation "1" here is an ID in case there are multiple MWEs within the same sentence. We annotate both 'to', which was mistakenly identified as part of the MWE, as well as 'sprawy', which was mistakenly left out, as an error (). All remaining tokens are marked as correct (). Table 4b shows an example from the MT dataset on quality estimation for machine translation (MT). Here, the gold data provides us with "OK" and "BAD" labels for the individual tokens of the machine-generated translation as well as for the gaps between the tokens. The latter is done to be able to annotate missing passages in the machine translation output; i.e., a gap between tokens would be labelled "BAD" if the MT system should have produced more output at a given position in a sentence than it did. Since it is unclear to which (existing) tokens these "gap annotations" should be ascribed to, we do not consider them for the error classification, and only consider "OK/BAD" labels for the tokens that do appear in the data. Table 5 presents statistics and classifier results, corresponding to the analyses in Secs. 5.1 and 5.2, for each task/language pair. The column "Avg. error rate" corresponds to the error rates plotted in Fig. 2, while the "MFE" column shows the mor-phological feature entropy (cf. Sec. 5.2) for the respective language. "Avg. F 1 " shows the average F 1 -score after stratified 5-fold cross-validation (cf. Sec. 5.1), while "Avg. ∆F 1 " corresponds to the ∆F 1 -measure defined in Eq. (1).