In Universal Dependencies, compounds, which we understand as words containing two or more roots, are represented according to tokenization, which reflects the orthographic conventions of the language. A closed compound (e.g. waterfall) corresponds to a single word in Universal Dependencies while a hyphenated compound (father-in-law) and an open compound (apple pie) to multiple words. The aim of this paper is to open a discussion on how to move towards a more consistent annotation of compounds.The solution we argue for is to represent the internal structure of all compound types analogously to syntactic phrases, which would not only increase the comparability of compounding within and across languages, but also allow comparisons of compounds and syntactic phrases.
We present PaReNT (Parent Retrieval Neural Tool), a deep-learning-based multilingual tool performing retrieval and word formation classification in English, German, Dutch, Spanish, French, Russian, and Czech. Parent retrieval refers to determining the lexeme or lexemes the input lexeme was based on (e.g. “darkness” is traced back to “dark”; “waterfall” decomposes into “water” and “fall”). Additionally, PaReNT performs word formation classification, which determines the input lexeme as a compound e.g. “proofread”, a derivative (e.g. “deescalate”) or as an unmotivated word (e.g. “dog”). These seven languages are selected from three major branches of the Indo-European language family (Germanic, Romance, Slavic). Data is aggregated from a range of word-formation resources, as well as Wiktionary, to train and test the tool. The tool is based on a custom-architecture hybrid transformer block-enriched sequence-to-sequence neural network utilizing both a character-based and semantic representation of the input lexemes, with two output modules - one decoder-based dedicated to parent retrieval, and one classifier-based for word formation classification. PaReNT achieves a mean accuracy of 0.62 in parent retrieval and a mean balanced accuracy of 0.74 in word formation classification.
The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.
Our work aims at developing a multilingual data resource for morphological segmentation. We present a survey of 17 existing data resources relevant for segmentation in 32 languages, and analyze diversity of how individual linguistic phenomena are captured across them. Inspired by the success of Universal Dependencies, we propose a harmonized scheme for segmentation representation, and convert the data from the studied resources into this common scheme. Harmonized versions of resources available under free licenses are published as a collection called UniSegments 1.0.
This article gives an overview of how sentence meaning is represented in eleven deep-syntactic frameworks, ranging from those based on linguistic theories elaborated for decades to rather lightweight NLP-motivated approaches. We outline the most important characteristics of each framework and then discuss how particular language phenomena are treated across those frameworks, while trying to shed light on commonalities as well as differences.
The paper deals with merging two complementary resources of morphological data previously existing for Czech, namely the inflectional dictionary MorfFlex CZ and the recently developed lexical network DeriNet. The MorfFlex CZ dictionary has been used by a morphological analyzer capable of analyzing/generating several million Czech word forms according to the rules of Czech inflection. The DeriNet network contains several hundred thousand Czech lemmas interconnected with links corresponding to derivational relations (relations between base words and words derived from them). After summarizing basic characteristics of both resources, the process of merging is described, focusing on both rather technical aspects (growth of the data, measuring the quality of newly added derivational relations) and linguistic issues (treating lexical homonymy and vowel/consonant alternations). The resulting resource contains 970 thousand lemmas connected with 715 thousand derivational relations and is publicly available on the web under the CC-BY-NC-SA license. The data were incorporated in the MorphoDiTa library version 2.0 (which provides morphological analysis, generation, tagging and lemmatization for Czech) and can be browsed and searched by two web tools (DeriNet Viewer and DeriNet Search tool).
In the present paper, we describe the development of the lexical network DeriNet, which captures core word-formation relations on the set of around 266 thousand Czech lexemes. The network is currently limited to derivational relations because derivation is the most frequent and most productive word-formation process in Czech. This limitation is reflected in the architecture of the network: each lexeme is allowed to be linked up with just a single base word; composition as well as combined processes (composition with derivation) are thus not included. After a brief summarization of theoretical descriptions of Czech derivation and the state of the art of NLP approaches to Czech derivation, we discuss the linguistic background of the network and introduce the formal structure of the network and the semi-automatic annotation procedure. The network was initialized with a set of lexemes whose existence was supported by corpus evidence. Derivational links were created using three sources of information: links delivered by a tool for morphological analysis, links based on an automatically discovered set of derivation rules, and on a grammar-based set of rules. Finally, we propose some research topics which could profit from the existence of such lexical network.
Meanings of morphological categories are an indispensable component of representation of sentence semantics. In the Prague Dependency Treebank 2.0, sentence semantics is represented as a dependency tree consisting of labeled nodes and edges. Meanings of morphological categories are captured as attributes of tree nodes; these attributes are called grammatemes. The present paper focuses on morphological meanings of verbs, i.e. on meanings of the morphological category of tense, mood, aspect etc. After several introductory remarks, seven verbal grammatemes used in the PDT 2.0 annotation scenario are briefly introduced. After that, each of the grammatemes is examined. Three verbal grammatemes of the original set were included in the new set without changes, one of the grammatemes was extended, and three of them were substituted for three new ones. The revised grammateme set is to be included in the forthcoming version of PDT (tentatively called PDT 3.0). Rules for automatic and manual assignment of the revised grammatemes are further discussed in the paper.