Maria Copot
2024
Stranger than Paradigms Word Embedding Benchmarks Don’t Align With Morphology
Timothee Mickus
|
Maria Copot
Proceedings of the Society for Computation in Linguistics 2024
Eesthetic: A Paralex Lexicon of Estonian Paradigms
Sacha Beniamine
|
Mari Aigro
|
Matthew Baerman
|
Jules Bouton
|
Maria Copot
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We introduce Eesthetic, a comprehensive Estonian noun and verb lexicon sourced from the Ekilex database. It documents 5475 nouns inflecting for 28 paradigm cells and 5076 verbs inflecting for 51 cells, and comprises a total of 452885 inflected forms. Our openly accessible machine-readable dataset adheres to the Paralex standard. It comprises CSV tables linked by formal relationships. Metadata in JSON format, following the Frictionless standard, provides detailed descriptions of the tables and dataset. The lexicon offers extensive linguistic annotations, including orthographic forms, automatically transcribed phonemic transcriptions, non-canonical morphological phenomena such as overabundance and defectiveness, rich mapping of the paradigm cells and feature-values to other notation schemes, a decomposition of phonemes in distinctive features, and annotation of inflection classes. It is suited for both monolingual and comparative research, enabling qualitative and quantitative analysis. This paper outlines the creation process, rationale, and resulting structure, along with our set of rules for automatic orthography-to-phonemic transcription conversion.
2022
A Word-and-Paradigm Workflow for Fieldwork Annotation
Maria Copot
|
Sara Court
|
Noah Diewald
|
Stephanie Antetomaso
|
Micha Elsner
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
There are many challenges in morphological fieldwork annotation, it heavily relies on segmentation and feature labeling (which have both practical and theoretical drawbacks), it’s time-intensive, and the annotator needs to be linguistically trained and may still annotate things inconsistently. We propose a workflow that relies on unsupervised and active learning grounded in Word-and-Paradigm morphology (WP). Machine learning has the potential to greatly accelerate the annotation process and allow a human annotator to focus on problematic cases, while the WP approach makes for an annotation system that is word-based and relational, removing the need to make decisions about feature labeling and segmentation early in the process and allowing speakers of the language of interest to participate more actively, since linguistic training is not necessary. We present a proof-of-concept for the first step of the workflow, in a realistic fieldwork setting, annotators can process hundreds of forms per hour.
Search
Co-authors
- Timothee Mickus 1
- Sara Court 1
- Noah Diewald 1
- Stephanie Antetomaso 1
- Micha Elsner 1
- show all...