Findings of the SIGMORPHON 2021 Shared Task on Unsupervised Morphological Paradigm Clustering

We describe the second SIGMORPHON shared task on unsupervised morphology: the goal of the SIGMORPHON 2021 Shared Task on Unsupervised Morphological Paradigm Clustering is to cluster word types from a raw text corpus into paradigms. To this end, we release corpora for 5 development and 9 test languages, as well as gold partial paradigms for evaluation. We receive 14 submissions from 4 teams that follow different strategies, and the best performing system is based on adaptor grammars. Results vary significantly across languages. However, all systems are outperformed by a supervised lemmatizer, implying that there is still room for improvement.


Introduction
In recent years, most research in the area of computational morphology has focused on the application of supervised machine learning methods to word inflection: generating the inflected forms of a word, often a lemma, in order to express certain grammatical properties. For example, a supervised inflection system for Spanish might be provided with a lemma disfrutar (English: to enjoy) and morphological features such as indicative, present tense & 1 st person singular, and generate the corresponding inflected form disfruto as output.
However, a supervised machine learning setup is quite different from a human first language (L1) acquisition setting. Young children must learn to segment a continuous speech signal into discrete words and perform unsupervised classification, decoding, and eventually, inference with incomplete feedback on this noisy input. The task of unsupervised paradigm clustering aims to replicate one of the steps in this process-namely, the grouping of word forms belonging to the same lexeme into inflectional paradigms. In this unsupervised task, a system does not know about lemmas. Furthermore, neither does it know (a) the features for which a lemma typically inflects, nor (b) the number of distinct inflected forms which constitute the paradigm.
A successful unsupervised paradigm clustering system leverages common patterns in the language's inflectional morphology while simultaneously ignoring regular circumstantial similarities along with derivational patterns. For example, an accurate unsupervised system must recognize that disfrutamos (English: we enjoy) and disfruta (English: he/she/it enjoys) are inflected variants of the same paradigm, but that the orthographically similar disparamos (English: we shoot), belongs to a separate paradigm. Likewise, a successful system for English will recognize that walk and walked belong to the same verbal paradigm but walker is a derived form belonging to a distinct nominal paradigm. Such fine-grained distinctions are difficult to learn in an unsupervised manner. This paper describes the SIGMORPHON 2021 Shared Task on Unsupervised Morphological Paradigm Clustering. Participants are asked to submit systems which cluster words from the Bible into inflectional paradigms. 1 Participants are not allowed to use any external resources. Four teams submit at least one system for the shared task and all teams also submit a system description paper.
The shared task systems can be grouped into two broad categories: similarity-based systems experiment with different combinations of orthographic and embedding-based similarity metrics for word forms combined with clustering methods like k-means or agglomerative clustering. Grammarbased methods instead learn grammars or rules from the data and either apply these to clustering directly, or first segment words into stems and affixes and then cluster forms which share a stem into paradigms. Our official baseline, described in Section 2.3, is based on grouping together word forms sharing a common substring of length ≥ k, where k is a hyperparameter. Grammar-based systems obtain higher average F1 scores (see Section 2.2 for details on evaluation) across the nine test languages than the baseline. The Edinburgh system has the best overall performance: it outperforms the baseline by 34.61% F1 and the second best system by 1.84% F1.
The rest of the paper is organized as follows: Section 2 describes the task of unsupervised morphological paradigm clustering in detail, including the official baseline and all provided datasets. Section 3 gives an overview of the participating systems. Section 4 describes the official results, and 5 presents an analysis. Finally, Section 6 contains a discussion of where the task can move in future iterations and concludes the paper.

Task Description
Unsupervised morphological paradigm clustering consists of, given a raw text corpus, grouping words from that corpus into their paradigms without any additional information. Recent work in unsupervised morphology has attempted to induce full paradigms from corpora with only a subset of all types.  and Erdmann et al. (2020) explore initial approaches to this task, which is called unsupervised morphological paradigm completion, but find it to be challenging. Building upon the SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion , our shared task is focused on a subset of the overall problem: sorting words into paradigms. This can be seen as an initial step to paradigm completion, as unobserved types do not need to be induced, and the inflectional categories of paradigm slots do not need to be considered. Our languages span 4 writing systems, and represent fusional, agglutinative, templatic, and polysynthetic morphologies. The languages in the development set are mostly suffixing, except for Maltese, which is a templatic language. And while most of the test languages are also predominantly suffixing, Navajo employs prefixes and Basque uses both prefixes and suffixes.
Text Corpora We provide corpora from the Johns Hopkins University Bible Corpus (JHUBC) (McCarthy et al., 2020b) for all development and test languages. This is the only resource that systems are allowed to use.
Gold Partial Paradigms Along with the Bibles, we also release a set of gold partial paradigms for the development languages to be used for system development. Gold data sets are also compiled for the test languages, but these test sets are withheld until the completion of the shared task.
In order to produce gold partial paradigms, we first take the set of all paradigms Π for each language from UniMorph (McCarthy et al., 2020a). We then obtain gold partial paradigms ΠĜ = Π Σ, where Σ is the set of types attested in the Bible corpus. Finally, we sample up to 1000 of the resulting gold partial paradigms for each language, resulting in the set Π G according to the following steps: 1. Group gold paradigms in ΠĜ by size, resulting in the set G, where g k ∈ G is the group of paradigms with k forms in it.
2. Continually loop over all g k ∈ G and randomly sample one paradigm from g k until we have 1000 paradigms.
Because not every token in the Bible corpora is in UniMorph, we can only evaluate on the subset of paradigms that exist in the UniMorph database. In practice, this means that for several languages, we are not able to sample 1000 paradigms, cf. Tables 1 and 2. Notably, for Basque, we can only provide 12 paradigms.    Figure 2: An example matching of predicted paradigms in blue, and a gold paradigm in green. Words in red do not exist in the gold set, and thus cannot be evaluated.

Evaluation
As our task is entirely unsupervised, evaluation is not straightforward: as in , our evaluation requires a mapping from predicted paradigms to gold paradigms. Because our set of gold partial paradigms does not cover all words in the corpus, in practice we only evaluate against a subset of the clusters predicted by systems.
For these reasons, we want an evaluation that assesses the best matching paradigms, ignoring predicted forms that do not occur in the gold set, but still punishing for spurious predictions that are in the gold set. For example, Figure 2 shows two candidate matches for a gold partial paradigm. Each one contains a word that does not exist in the set of gold paradigms, and thus cannot be judged -these words are ignored and do not affect evaluation. In this example, the predicted P1 is a better match, resulting in a perfect F1 score. However, our evaluation punishes systems for predicting a second paradigm, P2, with words from G1, reducing the overall precision score of this submission.
Building upon BMAcc (Jin et al., 2020), we use best-match F1 score for evaluation. We define a paradigm as a set of word forms f ∈ π. Duplicate forms within π (syncretism) are discarded. Given a set of gold partial paradigms π g ∈ Π G , a set of predicted paradigms π p ∈ Π P , a gold vocabulary Σ g = π g , and a predicted vocabulary Σ p = π p , it works according to the following steps: 1. Redefine each predicted paradigm, removing the words that we cannot evaluate π p = π p Σ g , to form a set of pruned paradigms Π P .
2. Build a complete Bipartite graph over Π P and Π G , where the edge weight between π g i and π p j is the number of true positives |π g i π p j |.
3. Compute the maximum-weight full matching using Karp (1980), in order to find the optimal alignment between Π P and Π G 4. Assign all predicted words Σ p and all gold words Σ g a label corresponding to the gold paradigm, according to the matching found in 3. Any unmatched w p i ∈ Σ p is assigned a label corresponding to a spurious paradigm. 5. Compute the F1 score between the sets of labeled words in Σ p and Σ g

Baseline System
We provide a straightforward baseline that constructs paradigms based on substring overlap between words. We construct paradigms out of words that share a substring of length ≥ k. Since words can share multiple substrings, it is possible that multiple identical, redundant paradigms are created. We reduce these to a single paradigm. Words that do not belong to a cluster are assigned a singleton paradigm, that is, a paradigm that consists of only that word. We tune k on the development sets and find that k = 5 works best on average. This means that a word of less than 5 characters can only ever be in one, singleton, paradigm.

Submitted Systems
The Boulder-Perkoff-Daniels-Palmer team (Boulder-PDP; Perkoff et al., 2021) participates with four submissions, resulting from experiments with two different systems. Both systems apply k-means clustering on vector representations of input words. They differ in the type of vector representations used: either orthographic or semantic representations. Semantic skip-gram representations are generated using word2vec (Mikolov et al., 2013). For the orthographic representations, each word is encoded into a vector of fixed dimensionality equaling the word length |w max | for the longest word w max in the input corpus. They associate each character c ∈ Σ in the alphabet of the input corpus with a real number r ∈ [0, 1] and assign v i := r if the ith character of the input word w is c. If |w| < |w max |, the remaining entries are assigned to 0.
The number of clusters is a hyperparameter of the k-means clustering algorithm. In order to set this hyperparameter, Perkoff et al. (2021) experiment with a graph-based method. The word types in the corpus form the nodes of a graph, where the neighborhood of a word w consists of all words sharing a maximal substring with w. The graph is split into highly connected subgraphs (HCS) containing n nodes, where the number of edges that need to be cut in order to split the graph into two disconnected components is > n/2 (Hartuv and Shamir, 2000). The number of HCSs is then taken to be the cluster number. In practice, however, the graph-clustering step proves to be prohibitively slow and results for test languages are submitted using fixed numbers of clusters of size 500, 1000, 1500 and 1900. In experiments on the dev languages, they find that the orthographic representations outperform the semantic representations for all languages, and thus submit four systems utilizing orthographic representations.
The Boulder-Gerlach-Wiemerslage-Kann team (Boulder-GWK; Gerlach et al., 2021) submits two systems based on an unsupervised lemmatization system originally proposed by Rosa and Zabokrtský (2019). Their approach is based on agglomerative hierarchical clustering of word types, where the distance between word types is computed as a combination of a string distance metric and the cosine distance of fastText embeddings (Bojanowski et al., 2017). Their choice of fastText embeddings is due to the limited size of the shared task datasets. Two variants of edit distance are compared to quantify string distance: (1) Jaro-Winkler edit distance (Winkler, 1990) resembles regular edit distance of strings but emphasizes similarity at the start of strings which is likely to bias the system toward languages expressing inflection via suffixation.
(2) A weighted variant of edit distance, where costs for insertions, deletions and substitutions are derived from a character-based language model trained on the shared task data.
The CU-UBC (Yang et al., 2021) team provides systems that built upon the official shared task baseline -given the pseudo-paradigms found by the baseline, they extract inflection rules of multiple types. Comparing pairs of words in each paradigm, they learn both continuous and discontinuous character sequences that transform the first word into the second, following work on supervised inflectional morphology, such as Durrett and DeNero (2013); Hulden et al. (2014). Rules are sorted by frequency to separate genuine inflectional patterns from noise. Starting from a random seed word, paradigms are constructed by iteratively applying the most frequent rules. Generated paradigms are further tested for paradigm coherence using metrics such as graph degree calculation and fastText embedding similarity.  Table 3: Results on all test languages for all systems in %; the official shared task metric is best-match F1. To provide a more complete picture, we also show precision and recall. stanza is a supervised system. son et al., 2007) modeling word structure. Their work draws on parallels between the unsupervised paradigm clustering task and unsupervised morphological segmentation. Their grammars segment word forms in the shared task corpora into a sequence of zero or more prefixes and a single stem followed by zero or more suffixes. Based on the segmented words from the raw text data, they then determine whether the language uses prefixes or suffixes for inflection. The final stem for words in a predominantly suffixing lan-guage then consists of the prefixes and stem identified by the adaptor grammar. For a predominantly prefixing language, the final stem instead contains all suffixes of the word form. The team notes that this approach is unsuitable for languages which extensively make use of both prefixes and suffixes, such as Basque.
Finally, they group all words which share the same stem into paradigms. However, because sampling from an adaptor grammar is a nondeterministic process -i.e., the system may return multiple possible segmentations for a single word form -they construct preliminary clusters by including all forms which might share a given stem. Then they select the cluster that maximizes a score based on frequency of occurrence of the induced segment in all segmentations.

Results and Discussion
The official results obtained by all submitted systems on the test sets are shown in Table 3.
The Edinburgh system performs best overall with an average best-match F1 of 67.96%. In general, grammar-based systems attain the best results, with all of the CU-UBC systems and the Edinburgh system outperforming the baseline by at least 23.06% F1. The Boulder-GWK and Boulder-PDP systems, both of which perform clustering over word representations, approach but do not exceed baseline performance. Perkoff et al. (2021) found that clustering over word2vec embeddings performs poorly on the development languages, and their scores on the test set reflect clusters found with vectors based purely on orthography. The Boulder-GWK systems contain incomplete results, and partial evidence suggests that their clustering method, which combines both fastText embeddings trained on the provided bible corpora, and edit distance, can indeed outperform the baseline. However, it likely cannot outperform the grammarbased submissions.
For comparison, we also evaluate a supervised lemmatizer from the Stanza toolkit (Qi et al., 2020). The Stanza lemmatizer is a neural network model trained on Universal Dependencies (UD) treebanks (Nivre et al., 2020), which first tags for parts of speech, and then uses these tags to generate lemmas for a given word. Because there is no UD corpus in the current version for Navajo nor Kannada, we do not have scores for those languages. Stanza's accuracy on our task is far lower than that reported for lemmatization on UD data. We note, however, that 1) our data is from a different domain, 2) Biblical language in particular can differ strongly from contemporary text, and 3) we evaluate on only a partial set of types in the corpus, which could represent a particularly challenging set of paradigms for some languages. The Stanza lemmatizer outperforms all systems for all languages, except for German. This is unsurprising as it is a supervised system, though it is interesting that the German score falls short of that of the Edinburgh system.  Overgeneralization/Underspecification When acquiring language, children often overgeneralize morphological analogies to new, ungrammatical forms. For example, the past tense of the English verb to know might be expressed as knowed, rather than the irregular knew. The same behavior can also be observed in learning algorithms at some point during the learning process . This is reflected to some extent in Table 3 by trade-offs between precision and recall. A low precision, but high recall indicates that a system is overgeneralizing: some surface forms are erroneously assigned to too many paradigms. In effect, these systems are hypothesizing that a substring is productive, and thus proposing a paradigmatic relationship between two words. For example, the English words approach and approve share the stem approwith unproductive segments as suffixes. The baseline tends to overgeneralize due to its creation of large paradigms via a naive grouping of words by shared n-grams.
On the other hand, several systems seem to underspecify, indicated by their low recall. A low recall, but high precision indicates that a system does not attribute inflected forms to a paradigm that the form does in fact belong to. This can be caused by suppletion in systems based purely on orthography, for example, generating the paradigm with go and goes, but attributing went to a separate paradigm. Underspecification is apparent in the CU-UBC submissions that relied on discontinuous rules (CU-UBC 5, 6, and 7). This is likely because they filtered these systems down to far fewer rules than their prefix/suffix systems, in order to avoid severe overgeneralization that can result from spurious morphemes based on discontinuous substrings. Similarly, the Boulder-GWK systems both have reasonable precision, but very low recalls. They report that this is due to the fact that they ignore any words with less than a certain frequency in the corpus due to time constraints, thus creating small paradigms and ignoring many words completely.
Language and Typology In general, we find that Basque and Navajo are the two most difficult test languages. Both languages have relatively small Figure 3: Singleton paradigm counts for the best performing system on all test languages. Languages for which we have more than 100 paradigms on the left, and those for which we have less than 100 paradigms on the right. Predicted singleton paradigms are in red and blue, gold singleton paradigms are in grey. Figure 4: The F1 score across paradigm sizes for the best performing system on all test languages. From left to right, the graphs represent the groups of languages in increasing order of how well systems typically performed on them. F1 scores are interpolated for paradigm sizes that do not exist in a given language. corpora, and are typlogically agglutinative -that is, they express inflection via the concatenation of potentially many morpheme segments, which can result in a large number of unique surface forms. Both languages thus have relatively high type-token ratios (TTR) -especially Navajo, which has the highest TTR, cf. Table 2. It is also important to note that both Basque and Navajo have comparatively small sets of paradigms against which we evaluate. This leaves the possibility that the subset of paradigms in the gold set are particularly challenging. However, the differences between system scores indicates that these two languages do offer challenges related to their morphology.
Navajo is a predominantly prefixing language -the only one in the development and test setsand Basque also inflects using prefixes, though to a lesser extent. The top two performing systems both obtain low scores for Navajo. The CU-UBC-2 system considers only suffix rules, which results in it being the lowest performing CU-UBC system on Navajo. The Edinburgh submission should be able to identify prefixes and consider the suffix to be part of the stem in Navajo. However, the large number of types, for a relatively small Navajo cor-pus may cause difficulties for their algorithm that builds clusters based on affix frequency. Notably, the CU-UBC-7 system, which learns discontinuous rules rather than rules that model strictly concatenative morphology, performs best on Navajo by a large margin when compared to the best performing system, which relies on strictly concatenative grammars. It also performs best on Basque, though by a smaller margin. Another difficulty in Navajo morphology is that it exhibits verbal stem alternation for expressing mood, tense, and aspect, which creates challenges for systems that rely on rewrite rules or string similarity, based on continuous substrings. For instance, our evaluation algorithm aligns a singleton predicted paradigm to the gold paradigm in Table 4 for nearly all systems.
On Basque, most systems perform poorly. Mc-Curdy et al. (2021), the best performing system overall, obtains a low score for Basque, which may be due to their system assuming that a language inflects either via prefixation or suffixation, but not both, as Basque does. Other systems, however, attain similarly low scores for Basque.
The next tier of difficulty seems to comprise Finnish, Kannada, and Turkish, on which most sys-tems obtain low scores. All of those languages are suffixing, but also have an agglutinative morphology. The largest paradigm of each of these 3 languages are all in the top 4 largest paradigms in Table 2. This implies that large paradigm sizes and large numbers of distinct inflectional morphemestwo properties often assumed to correlate with agglutinative morphology -, coupled with sparse corpora to learn from, offer challenges for paradigm clustering. Though agglutinative morphology, having relatively unchanged morphemes across words, might be simpler for automatic segmentation systems than morphology characterized as fusional, our sparse data sets are likely to complicate this.
Finally, systems obtain the best results for English, followed by Spanish, and then Bulgarian. These three languages are also strongly suffixing, but typically express inflection with a single morpheme. German appears to be a bit of an outlier, generally exhibiting scores that lie somewhere between the highest scoring languages, and the more difficult agglutinative languages. McCurdy et al. (2021) hypothesize that this may be due to nonconcatenative morphology from German verbal circumfixes. This hypothesis could explain why the Boulder-GWK system performs better on German than other languages: it incorporates semantic information. However, the CU-UBC systems that use discontinuous rules (systems 5, 6, and 7), and thus should better model circumfixation, do not produce higher German scores than the continuous rules, including the suffix-only system.

Analysis: Partial Paradigm Sizes
The effect of the size of the gold partial paradigms on F1 score for the best system is illustrated in Figure 4. For Basque and Navajo, the F1 score tends to drop as paradigm size increases. We see the same trend for Finnish, Kannada, and German, with a few exceptions, but this trend does not exist for all languages. English resembles something like a bell shape, other than the low scoring outlier for the largest paradigms of size 7. Interestingly, Spanish and Turkish attain both very high and very low scores for larger paradigms.
An artifact of a sparse corpus is that many singleton paradigms arise. For theoretically larger paradigms, only a single inflected form might occur in such a small corpus. Of course, this also happens naturally for certain word classes. However, nouns, verbs, and occasionally adjectives typically form paradigms comprising several inflected forms. Figure 3 demonstrates that the best system tends to overgenerate singleton paradigms. We see this to some extent for all agglutinative languages, which may be due to the high number of typically long, unique forms. This is especially true for Navajo, which has a small corpus and extremely high typetoken ratio. On the other hand, for the languages for which the highest scores are obtained, Spanish and English, the system does not overgenerate singleton paradigms. Of the large number of singleton paradigms predicted for both languages, the vast majority are correct. For other systems not pictured in the figure, singleton paradigms are typically undergenerated for Spanish and English. In the case of English, this could be due to words that share a derivational relationship. For example, the word accomplishment might be assigned to the paradigm for the verb accomplish, when, in fact, their relationship is not inflectional.

Conclusion and Future Shared Tasks
We presented the SIGMORPHON 2021 Shared Task on Unsupervised Morphological Paradigm Clustering. Submissions roughly fell into two categories: similarity-based methods and grammarbased methods, with the latter proving more successful at the task of clustering inflectional paradigms. The best systems significantly improved over the provided n-gram baseline, roughly doubling the F1 score -mostly through much improved precision. A comparison against a supervised lemmatizer demonstrated that we have not yet reached the ceiling for paradigm clustering: many words are still either incorrectly left in singleton paradigms or incorrectly clustered with circumstantially (and often derivationally) related words. Regardless of the ground still to be covered, the submitted results were a successful first step in automatically inducing the morphology of a language without access to expert-annotated data.
Unsupervised morphological paradigm clustering is only the first step in a morphological learning process that more closely models human L1 acquisition. We envision future tasks expanding on this task to include other important aspects of morphological acquisition. Paradigm slot categorization is a natural next step. To correctly categorize paradigm slots, cross-paradigmatic similarities must be considered, for example, the German words liest and schreibt are both 3 rd person singular present indicative inflections of two different verbs. This can occasionally be identified via string similarity, but more often requires syntactic information. Syncretism (the collapsing of multiple paradigm slots into a single representation) further complicates the task. A similar subtask involves lemma identification, where a canonical form (Cotterell et al., 2016b) is identified within the paradigm. Likewise, another important task involves filling unrealized slots in paradigms by generating the correct surface form, which can be approached similarly to previous SIGMORPHON shared tasks on inflection (Cotterell et al., 2016a(Cotterell et al., , 2017Mc-Carthy et al., 2019;Vylomova et al., 2020), but will likely be based on noisy information from the slot categorization -all previous tasks have assumed that the morphosyntactic information provided to an inflector is correct. Currently, investigations into the robustness of these systems to noise are sparse.
Another direction for this task is the expansion to more under-resourced languages. The submitted results demonstrate that the task becomes particularly difficult when the provided raw text is small, but under-documented languages are often the ones most in need of morphological corpora. The JHUBC contains Bible data for more than 1500 languages, which can potentially be augmented by other raw text corpora because morphology is relatively stable across domains. Future tasks may enable the construction of inflectional paradigms in languages that require them to construct further computational tools.