Maxim Ionov

2025

A Lightweight String Based Method of Encoding Etymologies in Linked Data Lexical Resources
Anas Fahad Khan | Maxim Ionov | Paola Marongiu | Ana Salgado
Proceedings of the 5th Conference on Language, Data and Knowledge: The 5th OntoLex Workshop

In this submission we propose an approach to encoding etymological information as strings (“etymology strings”). We begin by discussing the advantages of such an approach with respect to one in which etymologies and etymons are explicitly represented as RDF individuals. Next we give a formal description of the regular language underlying our approach as an Extended Backus-Naur Form grammar (EBNF). We use the Chamuça Hindi lexicon as a test case for our approach and show some of the kinds of SPARQL queries which can be made using etymological strings.

pdf bib

pdf bib abs

Towards Multilingual Haikus: Representing Accentuation to Build Poems
Fernando Bobillo | Maxim Ionov | Eduardo Mena | Carlos Bobed
Proceedings of the 5th Conference on Language, Data and Knowledge

The paradigm of neuro-symbolic Artificial Intelligence is receiving an increasing attention in the last years to improve the results of intelligent systems by combining symbolic and subsymbolic methods. For example, existing Large Language Models (LLMs) could be enriched by taking into account background knowledge encoded using semantic technologies, such as Linguistic Linked Data (LLD). In this paper, we claim that LLD can aid Large Language Models by providing the necessary information to compute the number of poetic syllables, which would help LLMs to correctly generate poems with a valid metric. To do so, we propose an encoding for syllabic structure based on an extension of RDF vocabularies widely used in the field: POSTDATA and OntoLex-Lemon.

pdf bib abs

Addressing Variability in Interlinear Glossed Texts with Linguistic Linked Data
Maxim Ionov | Natalia Patiño Mazzotti
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)

In this paper, we identify types of uncertainty in interlinear glossed text (IGT) annotation, a common notation for language data in linguistic research.

pdf bib

pdf bib abs

Ligt: Towards an Ecosystem for Managing Interlinear Glossed Texts with Linguistic Linked Data
Maxim Ionov
Proceedings of the 5th Conference on Language, Data and Knowledge

Ligt is an RDF vocabulary developed for representing Interlinear Glossed Text, a common representation of language material used in particular in field linguistics and linguistic typology. In this paper, we look at its current status and different aspects of its adoption. More specifically, we explore the questions of data conversion, storage, and exploitation. We present ligttools, a set of newly developed converters, report on a series of experiments regarding querying Ligt datasets, and analyse the performance with various infrastructure configurations.

pdf bib

Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion
Katerina Gkirtzou | Slavko Žitnik | Jorge Gracia | Dagmar Gromann | Maria Pia di Buono | Johanna Monti | Maxim Ionov
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

pdf bib

2024

pdf bib abs

Bridging Computational Lexicography and Corpus Linguistics: A Query Extension for OntoLex-FrAC
Christian Chiarcos | Ranka Stanković | Maxim Ionov | Gilles Sérasset
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

OntoLex, the dominant community standard for machine-readable lexical resources in the context of RDF, Linked Data and Semantic Web technologies, is currently extended with a designated module for Frequency, Attestations and Corpus-based Information (OntoLex-FrAC). We propose a novel component for OntoLex-FrAC, addressing the incorporation of corpus queries for (a) linking dictionaries with corpus engines, (b) enabling RDF-based web services to exchange corpus queries and responses data dynamically, and (c) using conventional query languages to formalize the internal structure of collocations, word sketches, and colligations. The primary field of application of the query extension is in digital lexicography and corpus linguistics, and we present a proof-of-principle implementation in backend components of a novel platform designed to support digital lexicography for the Serbian language.

pdf bib abs

OntoLex Publication Made Easy: A Dataset of Verbal Aspectual Pairs for Bosnian, Croatian and Serbian
Ranka Stanković | Maxim Ionov | Medina Bajtarević | Lorena Ninčević
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024

This paper introduces a novel language resource for retrieving and researching verbal aspectual pairs in BCS (Bosnian, Croatian, and Serbian) created using Linguistic Linked Open Data (LLOD) principles. As there is no resource to help learners of Bosnian, Croatian, and Serbian as foreign languages to recognize the aspect of a verb or its pairs, we have created a new resource that will provide users with information about the aspect, as well as the link to a verb’s aspectual counterparts. This resource also contains external links to monolingual dictionaries, Wordnet, and BabelNet. As this is a work in progress, our resource only includes verbs and their perfective pairs formed with prefixes “pro”, “od”, “ot”, “iz”, “is” and “na”. The goal of this project is to have a complete dataset of all the aspectual pairs in these three languages. We believe it will be useful for research in the field of aspectology, as well as machine translation and other NLP tasks. Using this resource as an example, we also propose a sustainable approach to publishing small to moderate LLOD resources on the Web, both in a user-friendly way and according to the Linked Data principles.

pdf bib

pdf bib abs

On Modelling Corpus Citations in Computational Lexical Resources
Fahad Khan | Maxim Ionov | Christian Chiarcos | Laurent Romary | Gilles Sérasset | Besim Kabashi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this article we look at how two different standards for lexical resources, TEI and OntoLex, deal with corpus citations in lexicons. We will focus on how corpus citations in retrodigitised dictionaries can be modelled using each of the two standards since this provides us with a suitably challenging use case. After looking at the structure of an example entry from a legacy dictionary, we examine the two approaches offered by the two different standards by outlining an encoding for the example entry using both of them (note that this article features the first extended discussion of how the Frequency Attestation and Corpus (FrAC) module of OntoLex deals with citations). After comparing the two approaches and looking at the advantages and disadvantages of both, we argue for a combination of both. In the last part of the article we discuss different ways of doing this, giving our preference for a strategy which makes use of RDFa.

pdf bib abs

Linguistic LOD for Interoperable Morphological Description
Michael Rosner | Maxim Ionov
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024

Interoperability is a characteristic of a product or system that seamlessly works with another product or system and implies a certain level of independence from the context of use. Turning to language resources, interoperability is frequently cited as one important rationale underlying the use of LLOD representations and is generally regarded as highly desirable. In this paper we further elaborate this theme, distinguishing three different kinds of interoperability providing practical implementations with examples from morphology.

2023

pdf bib

Beyond Concatenative Morphology: Applying OntoLex-Morph to Maltese
Maxim Ionov | Mike Rosner
Proceedings of the 4th Conference on Language, Data and Knowledge

2022

pdf bib abs

Unifying Morphology Resources with OntoLex-Morph. A Case Study in German
Christian Chiarcos | Christian Fäth | Maxim Ionov
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The OntoLex vocabulary has become a widely used community standard for machine-readable lexical resources on the web. The primary motivation to use OntoLex in favor of tool- or application-specific formalisms is to facilitate interoperability and information integration across different resources. One of its extension that is currently being developed is a module for representing morphology, OntoLex-Morph. In this paper, we show how OntoLex-Morph can be used for the encoding and integration of different types of morphological resources on a unified basis. With German as the example, we demonstrate it for (a) a full-form dictionary with inflection information (Unimorph), (b) a dictionary of base forms and their derivations (UDer), (c) a dictionary of compounds (from GermaNet), and (d) lexicon and inflection rules of a finite-state parser/generator (SMOR/Morphisto). These data are converted to OntoLex-Morph, their linguistic information is consolidated and corresponding lexical entries are linked with each other.

pdf bib abs

Querying a Dozen Corpora and a Thousand Years with Fintan
Christian Chiarcos | Christian Fäth | Maxim Ionov
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Large-scale diachronic corpus studies covering longer time periods are difficult if more than one corpus are to be consulted and, as a result, different formats and annotation schemas need to be processed and queried in a uniform, comparable and replicable manner. We describes the application of the Flexible Integrated Transformation and Annotation eNgineering (Fintan) platform for studying word order in German using syntactically annotated corpora that represent its entire written history. Focusing on nominal dative and accusative arguments, this study hints at two major phases in the development of scrambling in modern German. Against more recent assumptions, it supports the traditional view that word order flexibility decreased over time, but it also indicates that this was a relatively sharp transition in Early New High German. The successful case study demonstrates the potential of Fintan and the underlying LLOD technology for historical linguistics, linguistic typology and corpus linguistics. The technological contribution of this paper is to demonstrate the applicability of Fintan for querying across heterogeneously annotated corpora, as previously, it had only been applied for transformation tasks. With its focus on quantitative analysis, Fintan is a natural complement for existing multi-layer technologies that focus on query and exploration.

pdf bib abs

Modelling Collocations in OntoLex-FrAC
Christian Chiarcos | Katerina Gkirtzou | Maxim Ionov | Besim Kabashi | Fahad Khan | Ciprian-Octavian Truică
Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference

Following presentations of frequency and attestations, and embeddings and distributional similarity, this paper introduces the third cornerstone of the emerging OntoLex module for Frequency, Attestation and Corpus-based Information, OntoLex-FrAC. We provide an RDF vocabulary for collocations, established as a consensus over contributions from five different institutions and numerous data sets, with the goal of eliciting feedback from reviewers, workshop audience and the scientific community in preparation of the final consolidation of the OntoLex-FrAC module, whose publication as a W3C community report is foreseen for the end of this year. The novel collocation component of OntoLex-FrAC is described in application to a lexicographic resource and corpus-based collocation scores available from the web, and finally, we demonstrate the capability and genericity of the model by showing how to retrieve and aggregate collocation information by means of SPARQL, and its export to a tabular format, so that it can be easily processed in downstream applications.

pdf bib

Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference
Thierry Declerck | John P. McCrae | Elena Montiel | Christian Chiarcos | Maxim Ionov
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

2021

pdf bib

Embeddings for the Lexicon: Modelling and Representation
Christian Chiarcos | Thierry Declerck | Maxim Ionov
Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6)

2020

pdf bib abs

The ACoLi Dictionary Graph
Christian Chiarcos | Christian Fäth | Maxim Ionov
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we report the release of the ACoLi Dictionary Graph, a large-scale collection of multilingual open source dictionaries available in two machine-readable formats, a graph representation in RDF, using the OntoLex-Lemon vocabulary, and a simple tabular data format to facilitate their use in NLP tasks, such as translation inference across dictionaries. We describe the mapping and harmonization of the underlying data structures into a unified representation, its serialization in RDF and TSV, and the release of a massive and coherent amount of lexical data under open licenses.

pdf bib abs

The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations.

pdf bib abs

Fintan - Flexible, Integrated Transformation and Annotation eNgineering
Christian Fäth | Christian Chiarcos | Björn Ebbrecht | Maxim Ionov
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce the Flexible and Integrated Transformation and Annotation eNgeneering (Fintan) platform for converting heterogeneous linguistic resources to RDF. With its modular architecture, workflow management and visualization features, Fintan facilitates the development of complex transformation pipelines by integrating generic RDF converters and augmenting them with extended graph processing capabilities: Existing converters can be easily deployed to the system by means of an ontological data structure which renders their properties and the dependencies between transformation steps. Development of subsequent graph transformation steps for resource transformation, annotation engineering or entity linking is further facilitated by a novel visual rendering of SPARQL queries. A graphical workflow manager allows to easily manage the converter modules and combine them to new transformation pipelines. Employing the stream-based graph processing approach first implemented with CoNLL-RDF, we address common challenges and scalability issues when transforming resources and showcase the performance of Fintan by means of a purely graph-based transformation of the Universal Morphology data to RDF.

pdf bib

Co-authors

Venues

CNS1

LAW1

Maxim Ionov

2025

2024

2023

2022

2021

2020

2018

2015

2012

Co-authors

Venues