Antti Arppe - ACL Anthology

Antti Arppe

2025

Analyzing and generating English phrases with finite-state methods to match and translate inflected Plains Cree word-forms
Antti Arppe
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

This paper presents two finite-state transducer tools, which can be used to analyze or generate simple English verb and noun phrases, that can be mapped with inflected Plains Cree (nêhiyawêwin) verb and noun forms. These tools support fetching an inflected Cree word-form directly with an appropriate plain English phrase, and conversely providing a rough translation of an inflected Cree word-form. Such functionalities can be used to improve the user friendliness of on-line dictionaries. The tools are extendable to other similarly morphologically complex languages.

Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages
Jordan Lachler | Godfred Agyapong | Antti Arppe | Sarah Moeller | Aditi Chaudhary | Shruti Rijhwani | Daisy Rosenblum
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

Creating an intelligent dictionary of Tsuut’ina one verb at a time
Christopher Cox | Bruce Starlight | Janelle Crane-Starlight | Hanna Big Crow | Antti Arppe
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

In this paper, we discuss the development of a long-term partnership between community and university-based language workers to create supportive language technologies for Tsuutina, a critically endangered Dene language spoken in southern Alberta, Canada. Initial development activities in this partnership sought to rapidly integrate existing language materials, with the aim of arriving at tools that would be effective and impactful for community use by virtue of their extensive lexical coverage. We describe how, as this partnership developed, this approach was gradually superseded by one that involved a more targeted, lexical-item-by-lexical-item review process that was directly informed by other community language priorities and connected to the work a local language authority. We describe how this shift in processes correlated with other changes in local language programs and priorities, noting how ongoing communication allowed this partnership to adapt to the evolving needs of local organizations.

AI for Interlinearization and POS-tagging: Teaching Linguists to Fish
Olga Kriukova | Katherine Schmirler | Sarah Moeller | Olga Lovick | Inge Genee | Antti Arppe | Alexandra Smith
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

This paper describes the process and learn- ing outcomes of a three-day workshop on ma- chine learning basics for documentary linguists. During this workshop, two groups of linguists working with two Indigenous languages of North America, Blackfoot and Dënë Su ̨łıné, became acquainted with machine learning prin- ciples, explored how machine learning can be used in data processing for under-resourced languages and then applied different machine learning methods for automatic morphologi- cal interlinearization and parts-of-speech tag- ging. As a result, participants discovered paths to greater collaboration between computer sci- ence and documentary linguistics and reflected on how linguists might be enabled to apply ma- chine learning with less dependence on experts.

2024

Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages
Sarah Moeller | Godfred Agyapong | Antti Arppe | Aditi Chaudhary | Shruti Rijhwani | Christopher Cox | Ryan Henke | Alexis Palmer | Daisy Rosenblum | Lane Schwartz
Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages

Are modern neural ASR architectures robust for polysynthetic languages?
Eric Le Ferrand | Zoey Liu | Antti Arppe | Emily Prud’hommeaux
Findings of the Association for Computational Linguistics: EMNLP 2024

Automatic speech recognition (ASR) technology is frequently proposed as a means of preservation and documentation of endangered languages, with promising results thus far. Among the endangered languages spoken today, a significant number exhibit complex morphology. The models employed in contemporary language documentation pipelines that utilize ASR, however, are predominantly based on isolating or inflectional languages, often from the Indo-European family. This raises a critical concern: building models exclusively on such languages may introduce a bias, resulting in better performance with simpler morphological structures. In this paper, we investigate the performance of modern ASR architectures on morphologically complex languages. Results indicate that modern ASR architectures appear less robust in managing high OOV rates for morphologically complex languages in terms of word error rate, while character error rates are consistently higher for isolating languages.

Word-level prediction in Plains Cree: First steps
Olga Kriukova | Antti Arppe
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

Plains Cree (nêhiyawêwin) is a morphologically complex and predominantly prefixing language. The combinatory potential of inflectional and derivational/lexical prefixes and verb stems in Plains Cree makes it challenging for traditional auto-completion (or word suggestion) approaches to handle. The lack of a large corpus of Plains Cree also complicates the situation. This study attempts to investigate how well a BiLSTM model trained on a small Cree corpus can handle a word suggestion task. Moreover, this study evaluates whether the use of semantically and morphosyntactically refined Word2Vec embeddings can improve the overall accuracy and quality of BiLSTM suggestions. The results show that some models trained with the refined vectors provide semantically and morphosyntactically better suggestions. They are also more accurate in predictions of content words. The model trained with the non-refined vectors, in contrast, was better at predicting conjunctions, particles, and other non-inflecting words. The models trained with different refined vector combinations provide the expected next word among top-10 predictions in 36.73 to 37.88% of cases (depending on the model).

Machine-in-the-Loop with Documentary and Descriptive Linguists
Sarah Moeller | Antti Arppe
Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages

This paper describes a curriculum for teaching linguists how to apply machine-in-the-loop (MitL) approach to documentary and descriptive tasks. It also shares observations about the learning participants, who are primarily non-computational linguists, and how they interact with the MitL approach. We found that they prefer cleaning over increasing the training data and then proceed to reanalyze their analytical decisions, before finally undertaking small actions that emphasize analytical strategies. Overall, participants display an understanding of the curriculum which covers fundamental concepts of machine learning and statistical modeling.

2023

Finding words that aren’t there: Using word embeddings to improve dictionary search for low-resource languages
Antti Arppe | Andrew Neitsch | Daniel Dacanay | Jolene Poulin | Daniel Hieber | Atticus Harrigan
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

Modern machine learning techniques have produced many impressive results in language technology, but these techniques generally require an amount of training data that is many orders of magnitude greater than what exists for low-resource languages in general, and endangered ones in particular. However, dictionary definitions in a comparatively much more well-resourced majority language can provide a link between low-resource languages and machine learning models trained on massive amounts of majority-language data. By leveraging a pre-trained English word embedding to compute sentence embeddings for definitions in bilingual dictionaries for four Indigenous languages spoken in North America, Plains Cree (nhiyawwin), Arapaho (Hinno’itit), Northern Haida (Xaad Kl), and Tsuut’ina (Tst’n), we have obtained promising results for dictionary search. Not only are the search results in the majority language of the definitions more relevant, but they can be semantically relevant in ways not achievable with classic information retrieval techniques: users can perform successful searches for words that do not occur at all in the dictionary. These techniques are directly applicable to any bilingual dictionary providing translations between a high- and low-resource language.

Speech Database (Speech-DB) – An on-line platform for storing, validating, searching, and recording spoken language data
Jolene Poulin | Daniel Dacanay | Antti Arppe
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

The Speech Database (Speech-DB: URL: https://speech-db.altlab.app) is an on-line platform for language documentation, written and spoken language validation, and speech exploration; its code-base is available as open source. In its current state, Speech-DB has expanded to contain content for several Indigenous languages spoken in Western Canada, having started with audio for the dialect of Plains Cree spoken in Maskwacîs, Alberta, Canada. Currently, it is used primarily for validation and storage. It can be accessed by anyone with an internet connection in six levels of access rights. What follows is the rationale for the development of speech-DB, an exploration of its features, and a description of usage scenarios, as well as initial user feedback on the application.

Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Atticus Harrigan | Aditi Chaudhary | Shruti Rijhwani | Sarah Moeller | Antti Arppe | Alexis Palmer | Ryan Henke | Daisy Rosenblum
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

2022

Interactive Word Completion for Plains Cree
William Lane | Atticus Harrigan | Antti Arppe
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The composition of richly-inflected words in morphologically complex languages can be a challenge for language learners developing literacy. Accordingly, Lane and Bird (2020) proposed a finite state approach which maps prefixes in a language to a set of possible completions up to the next morpheme boundary, for the incremental building of complex words. In this work, we develop an approach to morph-based auto-completion based on a finite state morphological analyzer of Plains Cree (nêhiyawêwin), showing the portability of the concept to a much larger, more complete morphological transducer. Additionally, we propose and compare various novel ranking strategies on the morph auto-complete output. The best weighting scheme ranks the target completion in the top 10 results in 64.9% of queries, and in the top 50 in 73.9% of queries.

An Expanded Finite-State Transducer for Tsuut’ina Verbs
Joshua Holden | Christopher Cox | Antti Arppe
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper describes the expansion of a finite state transducer (FST) for the transitive verb system of Tsuut’ina (ISO 639-3: srs), a Dene (Athabaskan) language spoken in Alberta, Canada. Dene languages have unique templatic morphology, in which lexical, inflectional and derivational tiers are interlaced. Drawing on data from close to 9,000 verbal forms, the expanded model can handle a great range of common and rare argument structure types, including ditransitive and uniquely Dene object experiencer verbs. While challenges of speed remain, this expansion shows the ability of FST modelling to handle morphology of this type, and the expnded FST shows great promise for community language applications such as a morphologically informed online dictionary and word predictor, and for further FST development. This paper describes the expansion of a finite state transducer (FST) for the transitive verb system of Tsuut’ina (ISO 639-3: srs), a Dene (Athabaskan) language spoken in Alberta, Canada. Dene languages have unique templatic morphology, in which lexical, inflectional and derivational tiers are interlaced. Drawing on data from over 12,000 verbs forms, the expanded model can handle a great range of common and rare argument structure types, including ditransitive and uniquely Dene object experiencer verbs. While challenges of speed remain, this expansion shows the ability of FST modelling to handle morphology of this type, and the expnded FST shows great promise for community language applications such as a morphologically informed online dictionary and word predictor, and for further FST development.

Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Sarah Moeller | Antonios Anastasopoulos | Antti Arppe | Aditi Chaudhary | Atticus Harrigan | Josh Holden | Jordan Lachler | Alexis Palmer | Shruti Rijhwani | Lane Schwartz
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

2021

Leveraging English Word Embeddings for Semi-Automatic Semantic Classification in Nêhiyawêwin (Plains Cree)
Atticus Harrigan | Antti Arppe
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper details a semi-automatic method of word clustering for the Algonquian language, Nêhiyawêwin (Plains Cree). Although this method worked well, particularly for nouns, it required some amount of manual postprocessing. The main benefit of this approach over implementing an existing classification ontology is that this method approaches the language from an endogenous point of view, while performing classification quicker than in a fully manual context.

The More Detail, the Better? – Investigating the Effects of Semantic Ontology Specificity on Vector Semantic Classification with a Plains Cree / nêhiyawêwin Dictionary
Daniel Dacanay | Atticus Harrigan | Arok Wolvengrey | Antti Arppe
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

One problem in the task of automatic semantic classification is the problem of determining the level on which to group lexical items. This is often accomplished using pre-made, hierarchical semantic ontologies. The following investigation explores the computational assignment of semantic classifications on the contents of a dictionary of nêhiyawêwin / Plains Cree (ISO: crk, Algonquian, Western Canada and United States), using a semantic vector space model, and following two semantic ontologies, WordNet and SIL’s Rapid Words, and compares how these computational results compare to manual classifications with the same two ontologies.

Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)
Antti Arppe | Jeff Good | Atticus Harrigan | Mans Hulden | Jordan Lachler | Sarah Moeller | Alexis Palmer | Miikka Silfverberg | Lane Schwartz
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

Computational Analysis versus Human Intuition: A Critical Comparison of Vector Semantics with Manual Semantic Classification in the Context of Plains Cree
Daniel Dacanay | Atticus Harrigan | Antti Arppe
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2020

Automated Phonological Transcription of Akkadian Cuneiform Text
Aleksi Sahala | Miikka Silfverberg | Antti Arppe | Krister Lindén
Proceedings of the Twelfth Language Resources and Evaluation Conference

Akkadian was an East-Semitic language spoken in ancient Mesopotamia. The language is attested on hundreds of thousands of cuneiform clay tablets. Several Akkadian text corpora contain only the transliterated text. In this paper, we investigate automated phonological transcription of the transliterated corpora. The phonological transcription provides a linguistically appealing form to represent Akkadian, because the transcription is normalized according to the grammatical description of a given dialect and explicitly shows the Akkadian renderings for Sumerian logograms. Because cuneiform text does not mark the inflection for logograms, the inflected form needs to be inferred from the sentence context. To the best of our knowledge, this is the first documented attempt to automatically transcribe Akkadian. Using a context-aware neural network model, we are able to automatically transcribe syllabic tokens at near human performance with 96% recall @ 3, while the logogram transcription remains more challenging at 82% recall @ 3.

BabyFST - Towards a Finite-State Based Computational Model of Ancient Babylonian
Aleksi Sahala | Miikka Silfverberg | Antti Arppe | Krister Lindén
Proceedings of the Twelfth Language Resources and Evaluation Conference

Akkadian is a fairly well resourced extinct language that does not yet have a comprehensive morphological analyzer available. In this paper we describe a general finite-state based morphological model for Babylonian, a southern dialect of the Akkadian language, that can achieve a coverage up to 97.3% and recall up to 93.7% on lemmatization and POS-tagging task on token level from a transcribed input. Since Akkadian word forms exhibit a high degree of morphological ambiguity, in that only 20.1% of running word tokens receive a single unambiguous analysis, we attempt a first pass at weighting our finite-state transducer, using existing extensive Akkadian corpora which have been partially validated for their lemmas and parts-of-speech but not the entire morphological analyses. The resultant weighted finite-state transducer yields a moderate improvement so that for 57.4% of the word tokens the highest ranked analysis is the correct one. We conclude with a short discussion on how morphological ambiguity in the analysis of Akkadian could be further reduced with improvements in the training data used in weighting the finite-state transducer as well as through other, context-based techniques.

2019

Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)
Antti Arppe | Jeff Good | Mans Hulden | Jordan Lachler | Alexis Palmer | Lane Schwartz | Miikka Silfverberg
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

A Preliminary Plains Cree Speech Synthesizer
Atticus Harrigan | Antti Arppe | Timothy Mills
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2018

A Computational Architecture for the Morphology of Upper Tanana
Olga Lovick | Christopher Cox | Miikka Silfverberg | Antti Arppe | Mans Hulden
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Indigenous language technologies in Canada: Assessment, challenges, and successes
Patrick Littell | Anna Kazantseva | Roland Kuhn | Aidan Pine | Antti Arppe | Christopher Cox | Marie-Odile Junker
Proceedings of the 27th International Conference on Computational Linguistics

In this article, we discuss which text, speech, and image technologies have been developed, and would be feasible to develop, for the approximately 60 Indigenous languages spoken in Canada. In particular, we concentrate on technologies that may be feasible to develop for most or all of these languages, not just those that may be feasible for the few most-resourced of these. We assess past achievements and consider future horizons for Indigenous language transliteration, text prediction, spell-checking, approximate search, machine translation, speech recognition, speaker diarization, speech synthesis, optical character recognition, and computer-aided language learning.

Modeling Northern Haida Verb Morphology
Jordan Lachler | Lene Antonsen | Trond Trosterud | Sjur Moshagen | Antti Arppe
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Building a Constraint Grammar Parser for Plains Cree Verbs and Arguments
Katherine Schmirler | Antti Arppe | Trond Trosterud | Lene Antonsen
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

Converting a comprehensive lexical database into a computational model: The case of East Cree verb inflection
Antti Arppe | Marie-Odile Junker | Delasie Torkornoo
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

A Morphological Parser for Odawa
Dustin Bowers | Antti Arppe | Jordan Lachler | Sjur Moshagen | Trond Trosterud
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages
Antti Arppe | Jeff Good | Mans Hulden | Jordan Lachler | Alexis Palmer | Lane Schwartz
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2016

Training & Quality Assessment of an Optical Character Recognition Model for Northern Haida
Isabell Hubert | Antti Arppe | Jordan Lachler | Eddie A. Santos
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We are presenting our work on the creation of the first optical character recognition (OCR) model for Northern Haida, also known as Masset or Xaad Kil, a nearly extinct First Nations language spoken in the Haida Gwaii archipelago in British Columbia, Canada. We are addressing the challenges of training an OCR model for a language with an extensive, non-standard Latin character set as follows: (1) We have compared various training approaches and present the results of practical analyses to maximize recognition accuracy and minimize manual labor. An approach using just one or two pages of Source Images directly performed better than the Image Generation approach, and better than models based on three or more pages. Analyses also suggest that a character’s frequency is directly correlated with its recognition accuracy. (2) We present an overview of current OCR accuracy analysis tools available. (3) We have ported the once de-facto standardized OCR accuracy tools to be able to cope with Unicode input. Our work adds to a growing body of research on OCR for particularly challenging character sets, and contributes to creating the largest electronic corpus for this severely endangered language.

2014

Modeling the Noun Morphology of Plains Cree
Conor Snoek | Dorothy Thunder | Kaidi Lõo | Antti Arppe | Jordan Lachler | Sjur Moshagen | Trond Trosterud
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages

2000

Developing a grammar checker for Swedish
Antti Arppe
Proceedings of the 12th Nordic Conference of Computational Linguistics (NODALIDA 1999)

Co-authors

Venues