Antti Arppe


2022

pdf bib
An Expanded Finite-State Transducer for Tsuut’ina Verbs
Joshua Holden | Christopher Cox | Antti Arppe
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper describes the expansion of a finite state transducer (FST) for the transitive verb system of Tsuut’ina (ISO 639-3: srs), a Dene (Athabaskan) language spoken in Alberta, Canada. Dene languages have unique templatic morphology, in which lexical, inflectional and derivational tiers are interlaced. Drawing on data from close to 9,000 verbal forms, the expanded model can handle a great range of common and rare argument structure types, including ditransitive and uniquely Dene object experiencer verbs. While challenges of speed remain, this expansion shows the ability of FST modelling to handle morphology of this type, and the expnded FST shows great promise for community language applications such as a morphologically informed online dictionary and word predictor, and for further FST development.This paper describes the expansion of a finite state transducer (FST) for the transitive verb system of Tsuut’ina (ISO 639-3: srs), a Dene (Athabaskan) language spoken in Alberta, Canada. Dene languages have unique templatic morphology, in which lexical, inflectional and derivational tiers are interlaced. Drawing on data from over 12,000 verbs forms, the expanded model can handle a great range of common and rare argument structure types, including ditransitive and uniquely Dene object experiencer verbs. While challenges of speed remain, this expansion shows the ability of FST modelling to handle morphology of this type, and the expnded FST shows great promise for community language applications such as a morphologically informed online dictionary and word predictor, and for further FST development.

pdf bib
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Sarah Moeller | Antonios Anastasopoulos | Antti Arppe | Aditi Chaudhary | Atticus Harrigan | Josh Holden | Jordan Lachler | Alexis Palmer | Shruti Rijhwani | Lane Schwartz
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Interactive Word Completion for Plains Cree
William Lane | Atticus Harrigan | Antti Arppe
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The composition of richly-inflected words in morphologically complex languages can be a challenge for language learners developing literacy. Accordingly, Lane and Bird (2020) proposed a finite state approach which maps prefixes in a language to a set of possible completions up to the next morpheme boundary, for the incremental building of complex words. In this work, we develop an approach to morph-based auto-completion based on a finite state morphological analyzer of Plains Cree (nêhiyawêwin), showing the portability of the concept to a much larger, more complete morphological transducer. Additionally, we propose and compare various novel ranking strategies on the morph auto-complete output. The best weighting scheme ranks the target completion in the top 10 results in 64.9% of queries, and in the top 50 in 73.9% of queries.

2021

pdf bib
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)
Antti Arppe | Jeff Good | Atticus Harrigan | Mans Hulden | Jordan Lachler | Sarah Moeller | Alexis Palmer | Miikka Silfverberg | Lane Schwartz
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib
Computational Analysis versus Human Intuition: A Critical Comparison of Vector Semantics with Manual Semantic Classification in the Context of Plains Cree
Daniel Dacanay | Atticus Harrigan | Antti Arppe
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib
Leveraging English Word Embeddings for Semi-Automatic Semantic Classification in Nêhiyawêwin (Plains Cree)
Atticus Harrigan | Antti Arppe
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper details a semi-automatic method of word clustering for the Algonquian language, Nêhiyawêwin (Plains Cree). Although this method worked well, particularly for nouns, it required some amount of manual postprocessing. The main benefit of this approach over implementing an existing classification ontology is that this method approaches the language from an endogenous point of view, while performing classification quicker than in a fully manual context.

pdf bib
The More Detail, the Better? – Investigating the Effects of Semantic Ontology Specificity on Vector Semantic Classification with a Plains Cree / nêhiyawêwin Dictionary
Daniel Dacanay | Atticus Harrigan | Arok Wolvengrey | Antti Arppe
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

One problem in the task of automatic semantic classification is the problem of determining the level on which to group lexical items. This is often accomplished using pre-made, hierarchical semantic ontologies. The following investigation explores the computational assignment of semantic classifications on the contents of a dictionary of nêhiyawêwin / Plains Cree (ISO: crk, Algonquian, Western Canada and United States), using a semantic vector space model, and following two semantic ontologies, WordNet and SIL’s Rapid Words, and compares how these computational results compare to manual classifications with the same two ontologies.

2020

pdf bib
Automated Phonological Transcription of Akkadian Cuneiform Text
Aleksi Sahala | Miikka Silfverberg | Antti Arppe | Krister Lindén
Proceedings of the Twelfth Language Resources and Evaluation Conference

Akkadian was an East-Semitic language spoken in ancient Mesopotamia. The language is attested on hundreds of thousands of cuneiform clay tablets. Several Akkadian text corpora contain only the transliterated text. In this paper, we investigate automated phonological transcription of the transliterated corpora. The phonological transcription provides a linguistically appealing form to represent Akkadian, because the transcription is normalized according to the grammatical description of a given dialect and explicitly shows the Akkadian renderings for Sumerian logograms. Because cuneiform text does not mark the inflection for logograms, the inflected form needs to be inferred from the sentence context. To the best of our knowledge, this is the first documented attempt to automatically transcribe Akkadian. Using a context-aware neural network model, we are able to automatically transcribe syllabic tokens at near human performance with 96% recall @ 3, while the logogram transcription remains more challenging at 82% recall @ 3.

pdf bib
BabyFST - Towards a Finite-State Based Computational Model of Ancient Babylonian
Aleksi Sahala | Miikka Silfverberg | Antti Arppe | Krister Lindén
Proceedings of the Twelfth Language Resources and Evaluation Conference

Akkadian is a fairly well resourced extinct language that does not yet have a comprehensive morphological analyzer available. In this paper we describe a general finite-state based morphological model for Babylonian, a southern dialect of the Akkadian language, that can achieve a coverage up to 97.3% and recall up to 93.7% on lemmatization and POS-tagging task on token level from a transcribed input. Since Akkadian word forms exhibit a high degree of morphological ambiguity, in that only 20.1% of running word tokens receive a single unambiguous analysis, we attempt a first pass at weighting our finite-state transducer, using existing extensive Akkadian corpora which have been partially validated for their lemmas and parts-of-speech but not the entire morphological analyses. The resultant weighted finite-state transducer yields a moderate improvement so that for 57.4% of the word tokens the highest ranked analysis is the correct one. We conclude with a short discussion on how morphological ambiguity in the analysis of Akkadian could be further reduced with improvements in the training data used in weighting the finite-state transducer as well as through other, context-based techniques.

2019

pdf bib
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)
Antti Arppe | Jeff Good | Mans Hulden | Jordan Lachler | Alexis Palmer | Lane Schwartz | Miikka Silfverberg
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib
A Preliminary Plains Cree Speech Synthesizer
Atticus Harrigan | Antti Arppe | Timothy Mills
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2018

pdf bib
A Computational Architecture for the Morphology of Upper Tanana
Olga Lovick | Christopher Cox | Miikka Silfverberg | Antti Arppe | Mans Hulden
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Modeling Northern Haida Verb Morphology
Jordan Lachler | Lene Antonsen | Trond Trosterud | Sjur Moshagen | Antti Arppe
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Building a Constraint Grammar Parser for Plains Cree Verbs and Arguments
Katherine Schmirler | Antti Arppe | Trond Trosterud | Lene Antonsen
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Indigenous language technologies in Canada: Assessment, challenges, and successes
Patrick Littell | Anna Kazantseva | Roland Kuhn | Aidan Pine | Antti Arppe | Christopher Cox | Marie-Odile Junker
Proceedings of the 27th International Conference on Computational Linguistics

In this article, we discuss which text, speech, and image technologies have been developed, and would be feasible to develop, for the approximately 60 Indigenous languages spoken in Canada. In particular, we concentrate on technologies that may be feasible to develop for most or all of these languages, not just those that may be feasible for the few most-resourced of these. We assess past achievements and consider future horizons for Indigenous language transliteration, text prediction, spell-checking, approximate search, machine translation, speech recognition, speaker diarization, speech synthesis, optical character recognition, and computer-aided language learning.

2017

pdf bib
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages
Antti Arppe | Jeff Good | Mans Hulden | Jordan Lachler | Alexis Palmer | Lane Schwartz
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
A Morphological Parser for Odawa
Dustin Bowers | Antti Arppe | Jordan Lachler | Sjur Moshagen | Trond Trosterud
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Converting a comprehensive lexical database into a computational model: The case of East Cree verb inflection
Antti Arppe | Marie-Odile Junker | Delasie Torkornoo
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2016

pdf bib
Training & Quality Assessment of an Optical Character Recognition Model for Northern Haida
Isabell Hubert | Antti Arppe | Jordan Lachler | Eddie Antonio Santos
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We are presenting our work on the creation of the first optical character recognition (OCR) model for Northern Haida, also known as Masset or Xaad Kil, a nearly extinct First Nations language spoken in the Haida Gwaii archipelago in British Columbia, Canada. We are addressing the challenges of training an OCR model for a language with an extensive, non-standard Latin character set as follows: (1) We have compared various training approaches and present the results of practical analyses to maximize recognition accuracy and minimize manual labor. An approach using just one or two pages of Source Images directly performed better than the Image Generation approach, and better than models based on three or more pages. Analyses also suggest that a character’s frequency is directly correlated with its recognition accuracy. (2) We present an overview of current OCR accuracy analysis tools available. (3) We have ported the once de-facto standardized OCR accuracy tools to be able to cope with Unicode input. Our work adds to a growing body of research on OCR for particularly challenging character sets, and contributes to creating the largest electronic corpus for this severely endangered language.

2014

pdf bib
Modeling the Noun Morphology of Plains Cree
Conor Snoek | Dorothy Thunder | Kaidi Lõo | Antti Arppe | Jordan Lachler | Sjur Moshagen | Trond Trosterud
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages

2000

pdf bib
Developing a grammar checker for Swedish
Antti Arppe
Proceedings of the 12th Nordic Conference of Computational Linguistics (NODALIDA 1999)