2024
pdf
bib
abs
The MOLOR Lemma Bank: a New LLOD Resource for Old Irish
Theodorus Fransen
|
Cormac Anderson
|
Sacha Beniamine
|
Marco Passarotti
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
This paper describes the first steps in creating a Lemma Bank for Old Irish (600-900CE) within the Linked Data paradigm, taking inspiration from a similar resource for Latin built as part of the LiLa project (2018–2023). The focus is on the extraction and RDF conversion of nouns from Goidelex, a novel and highly structured morphological resource for Old Irish. The aim is to strike a good balance between retaining a representative level of morphological granularity and at the same time keeping the amount of lemma variants within workable limits, to facilitate straightforward resource interlinking for Old Irish, planned as future work.
pdf
bib
abs
Goidelex: A Lexical Resource for Old Irish
Cormac Anderson
|
Sacha Beniamine
|
Theodorus Fransen
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024
We introduce Goidelex, a new lexical database resource for Old Irish. Goidelex is an openly accessible relational database in CSV format, linked by formal relationships. The launch version documents 695 headwords with extensive linguistic annotations, including orthographic forms using a normalised orthography, automatically generated phonemic transcriptions, and information about morphosyntactic features, such as gender, inflectional class, etc. Metadata in JSON format, following the Frictionless standard, provides detailed descriptions of the tables and dataset. The database is designed to be fully compatible with the Paralex and CLDF standards and is interoperable with existing lexical resources for Old Irish such as CorPH and eDIL. It is suited to both qualitative and quantitative investigation into Old Irish morphology and lexicon, as well as to comparative research. This paper outlines the creation process, rationale, and resulting structure of the database.
pdf
bib
abs
Eesthetic: A Paralex Lexicon of Estonian Paradigms
Sacha Beniamine
|
Mari Aigro
|
Matthew Baerman
|
Jules Bouton
|
Maria Copot
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We introduce Eesthetic, a comprehensive Estonian noun and verb lexicon sourced from the Ekilex database. It documents 5475 nouns inflecting for 28 paradigm cells and 5076 verbs inflecting for 51 cells, and comprises a total of 452885 inflected forms. Our openly accessible machine-readable dataset adheres to the Paralex standard. It comprises CSV tables linked by formal relationships. Metadata in JSON format, following the Frictionless standard, provides detailed descriptions of the tables and dataset. The lexicon offers extensive linguistic annotations, including orthographic forms, automatically transcribed phonemic transcriptions, non-canonical morphological phenomena such as overabundance and defectiveness, rich mapping of the paradigm cells and feature-values to other notation schemes, a decomposition of phonemes in distinctive features, and annotation of inflection classes. It is suited for both monolingual and comparative research, enabling qualitative and quantitative analysis. This paper outlines the creation process, rationale, and resulting structure, along with our set of rules for automatic orthography-to-phonemic transcription conversion.
2021
pdf
bib
Multiple alignments of inflectional paradigms
Sacha Beniamine
|
Matías Guzmán Naranjo
Proceedings of the Society for Computation in Linguistics 2021
2020
pdf
bib
abs
Automated Parsing of Interlinear Glossed Text from Page Images of Grammatical Descriptions
Erich Round
|
Mark Ellison
|
Jayden Macklin-Cordes
|
Sacha Beniamine
Proceedings of the Twelfth Language Resources and Evaluation Conference
Linguists seek insight from all human languages, however accessing information from most of the full store of extant global linguistic descriptions is not easy. One of the most common kinds of information that linguists have documented is vernacular sentences, as recorded in descriptive grammars. Typically these sentences are formatted as interlinear glossed text (IGT). Most descriptive grammars, however, exist only as hardcopy or scanned pdf documents. Consequently, parsing IGTs in scanned grammars is a priority, in order to significantly increase the volume of documented linguistic information that is readily accessible. Here we demonstrate fundamental viability for a technology that can assist in making a large number of linguistic data sources machine readable: the automated identification and parsing of interlinear glossed text from scanned page images. For example, we attain high median precision and recall (>0.95) in the identification of examples sentences in IGT format. Our results will be of interest to those who are keen to see more of the existing documentation of human language, especially for less-resourced and endangered languages, become more readily accessible.
pdf
bib
abs
Opening the Romance Verbal Inflection Dataset 2.0: A CLDF lexicon
Sacha Beniamine
|
Martin Maiden
|
Erich Round
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce the Romance Verbal Inflection Dataset 2.0, a multilingual lexicon of Romance inflection covering 74 varieties. The lexicon provides verbal paradigm forms in broad IPA phonemic notation. Both lexemes and paradigm cells are organized to reflect cognacy. Such multi-lingual inflected lexicons annotated for two dimensions of cognacy are necessary to study the evolution of inflectional paradigms, and test linguistic hypotheses systematically. However, these resources seldom exist, and when they do, they are not usually encoded in computationally usable ways. The Oxford Online Database of Romance Verb Morphology provides this kind of information, however, it is not maintained anymore and is only available as a web service without interfaces for machine-readability. We collect its data and clean and correct it for consistency using both heuristics and expert annotator judgements. Most resources used to study language evolution computationally rely strictly on multilingual contemporary information, and lack information about prior stages of the languages. To provide such information, we augment the database with Latin paradigms from the LatInFlexi lexicon. Finally, to make it widely avalable, the resource is released under a GPLv3 license in CLDF format.
2017
pdf
bib
abs
Une approche universelle pour l’abstraction automatique d’alternances morphophonologiques (A universal algorithm for the automatical abstraction of morphophonological alternations)
Sacha Beniamine
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts
Cet article présente un algorithme implémenté pour l’inférence de patrons d’alternances morphophonologiques entre mots-formes. Il est universel au sens où il permet d’obtenir des classifications comparables d’une langue à l’autre sans préjuger des types d’alternances. Les patrons constituent une première étape pour les travaux quantitatifs dans l’approche Mot et Paradigme de la morphologie.