2022
pdf
bib
abs
Gi2Pi Rule-based, index-preserving grapheme-to-phoneme transformations
Aidan Pine
|
Patrick William Littell
|
Eric Joanis
|
David Huggins-Daines
|
Christopher Cox
|
Fineen Davis
|
Eddie Antonio Santos
|
Shankhalika Srikanth
|
Delasie Torkornoo
|
Sabrina Yu
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
This paper describes the motivation and implementation details for a rule-based, index-preserving grapheme-to-phoneme engine ‘Gi2Pi' implemented in pure Python and released under the open source MIT license. The engine and interface have been designed to prioritize the developer experience of potential contributors without requiring a high level of programming knowledge. ‘Gi2Pi' already provides mappings for 30 (mostly Indigenous) languages, and the package is accompanied by a web-based interactive development environment, a RESTful API, and extensive documentation to encourage the addition of more mappings in the future. We also present three downstream applications of ‘Gi2Pi' and show results of a preliminary evaluation.
2021
pdf
bib
abs
On the Computational Modelling of Michif Verbal Morphology
Fineen Davis
|
Eddie A. Santos
|
Heather Souter
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
This paper presents a finite-state computational model of the verbal morphology of Michif. Michif, the official language of the Métis peoples, is a uniquely mixed language with Algonquian and French origins. It is spoken across the Métis homelands in what is now called Canada and the United States, but it is highly endangered with less than 100 speakers. The verbal morphology is remarkably complex, as the already polysynthetic Algonquian patterns are combined with French elements and unique morpho-phonological interactions. The model presented in this paper, LI VERB KAA-OOSHITAHK DI MICHIF handles this complexity by using a series of composed finite-state transducers to model the concatenative morphology and phonological rule alternations that are unique to Michif. Such a rule-based approach is necessary as there is insufficient language data for an approach that uses machine learning. A language model such as LI VERB KAA-OOSHITAHK DI MICHIF furthers the goals of Indigenous computational linguistics in Canada while also supporting the creation of tools for documentation, education, and revitalization that are desired by the Métis community.
2020
pdf
bib
abs
The Indigenous Languages Technology project at NRC Canada: An empowerment-oriented approach to developing language software
Roland Kuhn
|
Fineen Davis
|
Alain Désilets
|
Eric Joanis
|
Anna Kazantseva
|
Rebecca Knowles
|
Patrick Littell
|
Delaney Lothian
|
Aidan Pine
|
Caroline Running Wolf
|
Eddie Santos
|
Darlene Stewart
|
Gilles Boulianne
|
Vishwa Gupta
|
Brian Maracle Owennatékha
|
Akwiratékha’ Martin
|
Christopher Cox
|
Marie-Odile Junker
|
Olivia Sammons
|
Delasie Torkornoo
|
Nathan Thanyehténhas Brinklow
|
Sara Child
|
Benoît Farley
|
David Huggins-Daines
|
Daisy Rosenblum
|
Heather Souter
Proceedings of the 28th International Conference on Computational Linguistics
This paper surveys the first, three-year phase of a project at the National Research Council of Canada that is developing software to assist Indigenous communities in Canada in preserving their languages and extending their use. The project aimed to work within the empowerment paradigm, where collaboration with communities and fulfillment of their goals is central. Since many of the technologies we developed were in response to community needs, the project ended up as a collection of diverse subprojects, including the creation of a sophisticated framework for building verb conjugators for highly inflectional polysynthetic languages (such as Kanyen’kéha, in the Iroquoian language family), release of what is probably the largest available corpus of sentences in a polysynthetic language (Inuktut) aligned with English sentences and experiments with machine translation (MT) systems trained on this corpus, free online services based on automatic speech recognition (ASR) for easing the transcription bottleneck for recordings of speech in Indigenous languages (and other languages), software for implementing text prediction and read-along audiobooks for Indigenous languages, and several other subprojects.
pdf
bib
abs
Design and evaluation of a smartphone keyboard for Plains Cree syllabics
Eddie Antonio Santos
|
Atticus Harrigan
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Plains Cree is a less-resourced language in Canada. To promote its usage online, we describe previous keyboard layouts for typing Plains Cree syllabics on smartphones. We describe our own solution whose development was guided by ergonomics research and corpus statistics. We then describe a case study in which three participants used a previous layout and our own, and we collected quantitative and qualitative data. We conclude that, despite observing accuracy improvements in user testing, introducing a brand new paradigm for typing Plains Cree syllabics may not be ideal for the community.
2019
pdf
bib
OCR evaluation tools for the 21st century
Eddie Antonio Santos
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)
2016
pdf
bib
abs
Training & Quality Assessment of an Optical Character Recognition Model for Northern Haida
Isabell Hubert
|
Antti Arppe
|
Jordan Lachler
|
Eddie A. Santos
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We are presenting our work on the creation of the first optical character recognition (OCR) model for Northern Haida, also known as Masset or Xaad Kil, a nearly extinct First Nations language spoken in the Haida Gwaii archipelago in British Columbia, Canada. We are addressing the challenges of training an OCR model for a language with an extensive, non-standard Latin character set as follows: (1) We have compared various training approaches and present the results of practical analyses to maximize recognition accuracy and minimize manual labor. An approach using just one or two pages of Source Images directly performed better than the Image Generation approach, and better than models based on three or more pages. Analyses also suggest that a character’s frequency is directly correlated with its recognition accuracy. (2) We present an overview of current OCR accuracy analysis tools available. (3) We have ported the once de-facto standardized OCR accuracy tools to be able to cope with Unicode input. Our work adds to a growing body of research on OCR for particularly challenging character sets, and contributes to creating the largest electronic corpus for this severely endangered language.