Jonáš Vidra

2022

Our work aims at developing a multilingual data resource for morphological segmentation. We present a survey of 17 existing data resources relevant for segmentation in 32 languages, and analyze diversity of how individual linguistic phenomena are captured across them. Inspired by the success of Universal Dependencies, we propose a harmonized scheme for segmentation representation, and convert the data from the studied resources into this common scheme. Harmonized versions of resources available under free licenses are published as a collection called UniSegments 1.0.

2019

pdf bib abs

Supervised Morphological Segmentation Using Rich Annotated Lexicon
Ebrahim Ansari | Zdeněk Žabokrtský | Mohammad Mahmoudi | Hamid Haghdoost | Jonáš Vidra
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Morphological segmentation of words is the process of dividing a word into smaller units called morphemes; it is tricky especially when a morphologically rich or polysynthetic language is under question. In this work, we designed and evaluated several Recurrent Neural Network (RNN) based models as well as various other machine learning based approaches for the morphological segmentation task. We trained our models using annotated segmentation lexicons. To evaluate the effect of the training data size on our models, we decided to create a large hand-annotated morphologically segmented corpus of Persian words, which is, to the best of our knowledge, the first and the only segmentation lexicon for the Persian language. In the experimental phase, using the hand-annotated Persian lexicon and two smaller similar lexicons for Czech and Finnish languages, we evaluated the effect of the training data size, different hyper-parameters settings as well as different RNN-based models.

pdf bib abs

Derivational Morphological Relations in Word Embeddings
Tomáš Musil | Jonáš Vidra | David Mareček
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Derivation is a type of a word-formation process which creates new words from existing ones by adding, changing or deleting affixes. In this paper, we explore the potential of word embeddings to identify properties of word derivations in the morphologically rich Czech language. We extract derivational relations between pairs of words from DeriNet, a Czech lexical network, which organizes almost one million Czech lemmas into derivational trees. For each such pair, we compute the difference of the embeddings of the two words, and perform unsupervised clustering of the resulting vectors. Our results show that these clusters largely match manually annotated semantic categories of the derivational relations (e.g. the relation ‘bake–baker’ belongs to category ‘actor’, and a correct clustering puts it into the same cluster as ‘govern–governor’).

pdf bib

DeriNet 2.0: Towards an All-in-One Word-Formation Resource
Jonáš Vidra | Zdeněk Žabokrtský | Magda Ševčíková | Lukáš Kyjánek
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

pdf bib

Universal Derivations Kickoff: A Collection of Harmonized Derivational Resources for Eleven Languages
Lukáš Kyjánek | Zdeněk Žabokrtský | Magda Ševčíková | Jonáš Vidra
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

2016

pdf bib abs

Merging Data Resources for Inflectional and Derivational Morphology in Czech
Zdeněk Žabokrtský | Magda Ševčíková | Milan Straka | Jonáš Vidra | Adéla Limburská
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The paper deals with merging two complementary resources of morphological data previously existing for Czech, namely the inflectional dictionary MorfFlex CZ and the recently developed lexical network DeriNet. The MorfFlex CZ dictionary has been used by a morphological analyzer capable of analyzing/generating several million Czech word forms according to the rules of Czech inflection. The DeriNet network contains several hundred thousand Czech lemmas interconnected with links corresponding to derivational relations (relations between base words and words derived from them). After summarizing basic characteristics of both resources, the process of merging is described, focusing on both rather technical aspects (growth of the data, measuring the quality of newly added derivational relations) and linguistic issues (treating lexical homonymy and vowel/consonant alternations). The resulting resource contains 970 thousand lemmas connected with 715 thousand derivational relations and is publicly available on the web under the CC-BY-NC-SA license. The data were incorporated in the MorphoDiTa library version 2.0 (which provides morphological analysis, generation, tagging and lemmatization for Czech) and can be browsed and searched by two web tools (DeriNet Viewer and DeriNet Search tool).

Co-authors

Venues

Fix author