MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology

Large-scale morphological databases provide essential input to a wide range of NLP applications. Inflectional data is of particular importance for morphologically rich (agglutinative and highly inflecting) languages, and derivations can be used, e.g. to infer the semantics of out-of-vocabulary words. Extending the scope of state-of-the-art multilingual morphological databases, we announce the release of MorphyNet, a high-quality resource with 15 languages, 519k derivational and 10.1M inflectional entries, and a rich set of morphological features. MorphyNet was extracted from Wiktionary using both hand-crafted and automated methods, and was manually evaluated to be of a precision higher than 98%. Both the resource generation logic and the resulting database are made freely available and are reusable as stand-alone tools or in combination with existing resources.


Introduction
Despite repeated paradigm shifts in computational linguistics and natural language processing, morphological analysis and its related tasks, such as lemmatization, stemming, or compound splitting, have always remained essential components within language processing systems. Recently, in the context of language models based on subword embeddings, a morphologically meaningful splitting of words has been shown to improve the efficiency of downstream tasks (Devlin et al., 2019;Sennrich et al., 2016;Bojanowski et al., 2017;Provilkov et al., 2020). In particular, the reintroduction of linguistically motivated approaches and high-quality linguistic resources into deep learning architectures has been crucial for dealing with morphologically rich-highly inflecting, agglutinative-languages more efficiently (Pinnis et al., 2017;Ataman and Federico, 2018;Gerz et al., 2018).
In response to such needs, and as simple and convenient substitutes for monolingual morphological analyzers, multilingual morphological databases have been developed, indicating for each word form entry one or more corresponding root or dictionary entries, as well as analysis (features) Metheniti and Neumann, 2020;Vidra et al., 2019). The precision and recall of these resources vary wildly, and there is still a lot of ground to cover with respect to the support of new languages, the modelling of the inflectional and derivational complexity of each language, as well as the richness of the information (features, affixes, parts of speech, etc.) provided.
As a further step towards extending online morphological data, we introduce MorphyNet, a new database that addresses both derivational and inflectional morphology. Its current version covers 15 languages and has 519k derivational and 10.1M inflectional entries, as well as a rich set of features (lemma, parts of speech, morphological tags, affixes, etc.). Similarly to certain existing databases, MorphyNet was built from Wiktionary data; however, our extraction logic allows for a more exhaustive coverage of both derivational and inflectional cases.
The contributions of this paper are the freely available MorphyNet resource, the description of the data extraction logic and tool, also made freely accessible, as well as its evaluation and comparison to state-of-the-art multilingual morphological databases. Due to the limited overlap between the contents of these resources and MorphyNet, we consider it as complementary and therefore usable in combination with them.
Section 2 of the paper presents the state of the art. Section 3 gives details on our method for generat- ing MorphyNet data. Section 4 presents the resulting resource, and Section 5 evaluates it. Section 6 concludes the paper.

State of the Art
Ever since the early days of computational linguistics, morphological analysis and its related taskssuch as stemming and lemmatization-have been part of NLP systems. Earlier grammar-based systems used finite-state transducers or affix stripping techniques, and certain of them were already multilingual and were capable of tackling morphologically complex languages (Beesley and Karttunen, 2003;Trón et al., 2005;Inxight, 2005). However, due to the costliness of producing the grammar rules that drove them, many of these systems were only commercially available.
More recently, several projects have followed the approach of formalizing and/or integrating existing morphological data for multiple languages. UDer (Universal Derivations) (Kyjánek et al., 2020) integrates 27 derivational morphology resources in 20 languages. UniMorph (Kirov et al., 2016 and the Wikinflection Corpus (Metheniti and Neumann, 2020) rely mostly on Wiktionary from which they extract inflectional information. Beyond the data source, however, the two last projects have little in common: UniMorph is by far more precise and complete, and being used as gold standard for NLP community (Cotterell et al., 2017 (recently covering 133 languages (McCarthy et al., 2020)), while Wikinflection follows a naïve, linguistically uninformed approach of merely concatenating affixes, generating an abundance of ungrammatical word forms (e.g. for Hungarian or Finnish).
MorphyNet is also based on extracting morphological information from Wiktionary, extending the scope of UniMorph by new extraction rules and logic. The first version of MorphyNet covers 15 languages, and it is distinct from other resources in three aspects: (1) it includes both inflectional and derivational data; (2) it extracts a significantly higher number of inflections from Wiktionary; and (3) it provides a wider range of morphological information. While for the languages it covers Mor-phyNet can be considered a superset of UniMorph, the latter supports more languages. With UDer, as we show in section 4, the overlap is minor on all languages. For these reasons, we consider Mor-phyNet as complementary to these databases, considerably enriching their coverage on the 15 supported languages but not replacing them.

MorphyNet Generation
MorphyNet is generated mainly from Wiktionary, through the following steps.
1. Filtering returns XML-based Wiktionary content from specific sections of relevant lexical entries: headword lines, etymology sections, and inflectional tables are returned for nouns, verbs, and adjectives.
2. Extraction obtains raw morphological data by parsing the sections above.
3. Enrichment algorithmically extends the coverage of derivations and inflections obtained from Wiktionary, through entirely distinct methods for inflection and derivation.
Below we explain the non-trivial Wiktionary extraction and enrichment steps, while Section 4 provides details on the generated resource itself.

Wiktionary Extraction
We extract inflectional and derivational data through hand-crafted extraction rules that target recurrent patterns in Wiktionary content both in source markdown and in HTML-rendered form.
With respect to UniMorph that takes a similar approach and scrapes tables that provide inflectional paradigms, the scope of extraction is considerably extended, also including headword lines and etymology sections. This allows us to obtain new derivations, inflections, and features not covered by UniMorph, such as gender information or noun and adjective declensions for Catalan, French, Italian, Spanish, Russian, English, or Serbo-Croatian. Our rules target nouns, adjectives, and verbs in all languages covered. Inflection extraction rules target two types of Wiktionary content: inflectional tables and headword lines. Inflectional tables provide conjugation and declension paradigms for a subset of verbs, nouns, and adjectives in Wiktionary. On tables, our extraction method was similar to that of Uni-Morph as described in (Kirov et al., 2016, with one major difference. UniMorph also extracted a large number of separate entries with modifier and auxiliary words, such as Spanish negative imperatives (no comas, no coma, no comamos etc.) or Finnish negative indicatives (en puhu, et puhu, eivät puhu etc.). MorphyNet, on the other hand, has a single entry for each distinct word form, regardless of the modifier word used. This policy had a particular impact on the size of the Finnish vocabulary.
As inflectional tables are only provided by Wiktionary for 62.5% 3 of nouns, verbs, and adjectives, we extended the scope of extraction to headword lines, such as banca f (plural banche) From this headword line, we extract two entries: one for banca is feminine singular and second for banche is feminine plural. We created specific parsing rules for nouns, verbs, and adjectives because each part of speech is described through a different set of morphological features. For example, valency (transitive or reflexive) and aspect (perfective or imperfective) are essential for verbs, while gender (masculine or feminine) and number (singular or plural) pertain to nouns and adjectives.
Derivation extraction rules were applied to the 3 Computed over the 15 languages covered by MorphyNet.  where we have a morphology entry {{suffix|en|accuse|-ation}} from the Wiktionary XML dump. After collecting all morphology entries, we applied the enrichment method to increase its coverage.

Derivation Enrichment
Derivation enrichment is based on a linguistically informed cross-lingual generalization of derivational patterns observed in Wiktionary data, in order to extend the coverage of derivational data.
In the example shown in Figure 2, Wiktionary contains the Portuguese derivation competir (to compete) → competição (competition) but not acusar (to accuse) → acusação (accusation). An indiscriminate application of the suffix -ção to all verbs would, of course, generate lots of false positives, such as chegar (to arrive) ↛ *chegação. Even when the target word does exist, the inferred derivation is often false, as in the case of corar (to blush) ↛ coração (heart). A counter-example from English could be jewel + -ery → jewellery but gal +-ery ↛ gallery.
For this reason, we use stronger cross-lingual derivational evidence to induce the applicability of the affix. In the example above, the existence of the English derivation accuse → accusation, where the meanings of the English and the corresponding Portuguese words are the same, serves as a strong hint for the applicability of the Portuguese pattern.
This intuition is formalized in MorphyNet as fol-  lows: if in language A a derivation from source word w A s to target word w A t through the affix a A is not explicitly asserted (e.g. by Wiktionary) but it is asserted for the corresponding cognates in at least one language B, then we infer its existence: where cog(x, y) means that the words x and y are cognates and der(b, a) = d that word d is derived from base word b and affix a. In our example, A = Portuguese, B = English, w A s = acusar, w B s = accuse, w A t = acusação, w B t = accusation, a A = -ção, and a B = -tion.
As shown in Figure 1, we exploited a cognate database, CogNet 4 (Batsuren et al., 2019, 2021), that has 8.1M cognate pairs, for evidence on cognacy: cog(w A , w B ) = True is asserted by the presence of the word pair in CogNet.
The result of enrichment was a total increase of 25.6% of the number of derivations in MorphyNet. Efficiency varies among languages, essentially depending on the completeness of the Wiktionary coverage: it was the lowest for English with 3% and the highest for Spanish with 57%.

Inflection Enrichment
The enrichment of inflectional data is based on the simple observation that Wiktionary does not provide the root word for all inflected forms. For example, for the Hungarian múltjával (with his/her/its past), Wiktionary provides 4 http://github.com/kbatsuren/CogNet the inflection múltja → múltjával (his/her/its past + instrumental). For múltja, in turn, it provides múlt → múltja (past + possessive). It does not, however, directly provide the combination of the two inflections: múlt → múltjával (past + possessive + instrumental). Inflection enrichment consists of inferring such missing rules from the existing data.
The case above is formalized as follows: if, after the Wiktionary extraction phase, the MorphyNet data contains the inflections w r → w 1 (with feature set F 1 ) as well as w 1 → w 2 (with feature set F 2 ), then we create the new inflection w r → w 2 with feature set F 1 ∪ F 2 .
The application of this logic increased the inflectional coverage of MorphyNet by 10.8% and its recall (with respect to ground truth data presented in section 5) by 8.2% on average.

The MorphyNet Resource
Morphynet is freely available for download, both as text files containing the data and as the source code of the Wiktionary extractor. 5 Two text files are provided per language: one for inflections and one for derivations. The structure of the two types of files is illustrated in Tables 1 and 2, respectively. As shown, MorphyNet covers all data fields provided by UniMorph for inflections and by UDer for derivations. In addition, it extends UniMorph by indicating the affix and the immediate source word that produced the inflection. Such information is useful, for example, to NLP applications that rely on subword information for understand-  ing out-of-vocabulary words. MorphyNet also extends the UDer structure by indicating the affix and the semantic category for the target word when it can be inferred from the morpheme. Such information is again useful for subword regularization of derivationally rich languages, such as English. Table 4 provides per-language statistics on Mor-phyNet data. The present version of the resource contains 10.6 million entries, of which 95% are inflections. Highly inflecting and agglutinative languages are dominating the resource as 55% of all entries belong to Finnish, Hungarian, Russian, and Serbo-Croatian. Language coverage above all depends on the completeness of Wiktionary, the main source of our data.

Evaluation
We evaluated MorphyNet through two different methods: (1) through comparison to ground truth and (2) through manual validation by experts.
Comparison to ground truth. The quality evaluation of morphology database is a challenging task due to many weird morphology aspects of languages evaluated (Gorman et al., 2019). As ground truth on inflections we used the Universal Dependencies 6 dataset (Nivre et al., 2016(Nivre et al., , 2017, which (among others) provides morphological analysis of inflected words over a multilingual corpus of hand-annotated sentences.  built a Python tool 7 to convert these treebanks into UniMorph schema (Sylak-Glassman, 2016). We evaluated both UniMorph 2.0 and MorphyNet against this data (performing the necessary mapping of feature tags beforehand) over the 11 languages in the intersection of the two resources: Hungarian (Vincze et al., 2010), Catalan, Spanish (Taulé et al., 2008), Czech (Bejček et al., 2013, Finnish (Pyysalo et al., 2015), Russian (Lyashevskaya et al., 2016), Serbo-Croatian , French (Guillaume et al., 2019), Italian (Bosco et al., 2013), Swedish (Nivre and Megyesi, 2007), and English (Silveira et al., 2014). Table 5 contains evaluation results over nouns, verbs, and adjectives separately, as well as totals per language. Missing data points (e.g. for Catalan nouns) indicate that UniMorph did not have any corresponding inflections. For languages and parts of speech where both resources provide data, Mor-phyNet always provides higher recall. The exception is Finnish because of our policy of not extracting conjugations with auxiliary and modifier words as separate entries (see Section 3.1). Overall, as  Table 4, MorphyNet contains about 47% more entries over the 11 languages where it overlaps with UniMorph. In terms of precision, the two resources are comparable, except for Finnish (adjectives) and Swedish (adjectives and verbs) where MorphyNet appears to be significantly more precise.
UDer (Kyjánek et al., 2020) is a collection of individual monolingual resources of derivational morphology. Most of them have been carefully evaluated against their own datasets and offer high quality. We evaluated MorphyNet derivational data against UDer over the nine languages covered by both resources: French (Hathout and Namer, 2014), Portuguese (de Paiva et al., 2014), Czech (Vidra et al., 2019), German (Zeller et al., 2013), Russian (Vodolazsky, 2020), Italian (Talamo et al., 2016), Finnish (Lindén and Carlson, 2010;Lindén et al., 2012), Latin (Litta et al., 2016), and English (Habash and Dorr, 2003). Statistics and results are shown in Table 6. First of all, the overlap between MorphyNet and UDer is small, which is visible from our recall values relative to UDer that vary between 0.6% (Czech) and 59.5% (Italian). Among the languages evaluated, six were better covered by MorphyNet and the remaining three (Czech, German, and Russian) by UDer. The agreement between the two resources, computed as Cohen's Kappa, was 0.85 overall, varying between 0.74 (Finnish) and 0.97 (Portuguese). If we consider UDer as gold standard, we obtain precision figures between 87% and 99%.
Manual evaluation was carried out by language experts over sample data from five languages: English, Italian, French, Hungarian, and Mongolian. The sample consisted of 1,000 randomly selected entries per language, half of them inflectional and the other half derivational. The experts were asked to validate the correctness of source-target word pairs, of morphemes, as well as inflectional features and parts of speech (the latter for derivations). Table 7 shows detailed results. The overall precision is 98.9%, per-language values varying between 98.2% (Hungarian) and 99.5% (English). The good results are proof both of the high quality of Wiktionary data and of the general correctness of the data extraction and enrichment logic of MorphyNet. A manual checking of the incorrect entries revealed that most of them were due to the failure of extraction rules due to occasional deviations in Wiktionary from its own conventions.

Conclusions and Future Work
We consider the resource released and described here as an initial work-in-progress version that we plan to extend and improve. We are currently  working on increasing the coverage to 20 languages. We also plan to extend MorphyNet data with additional features and the semantic categories of words (e.g. animate or inanimate object, action) inferred from derivations. We are planning to conduct a more in-depth study of our evaluation results, especially with respect to UDer where it is not yet clear whether the occasional lower precision figures (87% for Finnish, 88% for Russian) are due to mistakes in MorphyNet, in the UDer resources, or are caused by other factors.
A major piece of ongoing work concerns the representation of MorphyNet derivational data as a lexico-semantic graph, as it is done in wordnets (Miller, 1998;Giunchiglia et al., 2017) where derivationally related word senses are interconnected by associative relationships. This effort, justifying the -Net in the name of our resource, will allow us to address completeness issues in existing wordnets by extending them by morphological relations and derived words.
We are happy to offer the MorphyNet extraction logic to be reused on a community basis. As extending the tool with new Wiktionary extraction rules is straightforward, we hope that the availability of the tool will allow language coverage to grow even further. We also hope that the MorphyNet data and the extraction logic can serve existing high-quality projects such as UniMorph and UDer.