Workshop on the Use of Computational Methods in the Study of Endangered Languages (2024)


up

pdf (full)
bib (full)
Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages
Sarah Moeller | Godfred Agyapong | Antti Arppe | Aditi Chaudhary | Shruti Rijhwani | Christopher Cox | Ryan Henke | Alexis Palmer | Daisy Rosenblum | Lane Schwartz

pdf bib
Cloud-based Platform for Indigenous Language Sound Education
Min Chen | Chris Lee | Naatosi Fish | Mizuki Miyashita | James Randall

Blackfoot is challenging for English speaking instructors and learners to acquire because it exhibits unique pitch patterns. This study presents MeTILDA (Melodic Transcription in Language Documentation and Application) as a solution to teaching pitch patterns distinct from English. Specifically, we explore ways to improve data visualization through a visualized pronunciation teaching guide called Pitch Art. The working materials can be downloaded or stored in the cloud for further use and collaboration. These features are aimed to facilitate teachers in developing curriculum for learning pronunciation, and provide students with an interactive and integrative learning environment to better understand Blackfoot language and pronunciation.

pdf bib
Technology and Language Revitalization: A Roadmap for the Mvskoke Language
Julia Mainzinger

This paper is a discussion of how NLP can come alongside community efforts to aid in revitalizing the Mvskoke language. Mvskoke is a language indigenous to the southeastern United States that has seen an increase in language revitalization efforts in the last few years. This paper presents an overview of available resources in Mvskoke, an exploration of relevant NLP tasks and related work in endangered language contexts, and applications to language revitalization.

pdf bib
Investigating the productivity of Passamaquoddy medials: A computational approach
James Roberts

Little is known about medials in Passamaquoddy, which appear to be involved in the construction of verb stems in the language. Investigating the productivity of such morphemes using traditional fieldwork methods is a difficult undertaking that can be made easier with computational methods. I first generated a list of possible verb stems using a simple Python script, then compared this list against Passamaquoddy text corpora to see how many of these tokens were attested. If a given medial is productive, we should expect to see it in a large portion of possible verb stems that include said medial. If this assumption is correct, the corpora analysis will be a key indicator in determining the productivity of individual medials.

pdf bib
T is for Treu, but how do you pronounce that? Using C-LARA to create phonetic texts for Kanak languages
Pauline Welby | Fabrice Wacalie | Manny Rayner | Chatgpt-4 C-Lara-Instance

In Drehu, a language of the indigenous Kanak people of New Caledonia, the word treu ‘moon’ is pronounced [{tSe.u}]; but, even if they hear the word, the spelling pulls French speakers to a spurious pronunciation [tK{o}]. We implement a strategy to mitigate the influence of such orthographic conflicts, while retaining the benefits of written input on vocabulary learning. We present text in “phonetized” form, where words are broken down into components associated with mnemonically presented phonetic values, adapting features from the “Comment ça se prononce~?” multilingual phonetizer. We present an exploratory project where we used the ChatGPT-based Learning And Reading Assistant (C-LARA) to implement a version of the phonetizer strategy, outlining how the AI-engineered codebase and help from the AI made it easy to add the necessary extensions. We describe two proof-of-concept texts for learners produced using the platform, a Drehu alphabet book and a Drehu version of “The (North) Wind and the Sun”; both texts include native-speaker recorded audio, pronunciation respellings based on French orthography, and AI-generated illustrations.

pdf bib
Machine-in-the-Loop with Documentary and Descriptive Linguists
Sarah Moeller | Antti Arppe

This paper describes a curriculum for teaching linguists how to apply machine-in-the-loop (MitL) approach to documentary and descriptive tasks. It also shares observations about the learning participants, who are primarily non-computational linguists, and how they interact with the MitL approach. We found that they prefer cleaning over increasing the training data and then proceed to reanalyze their analytical decisions, before finally undertaking small actions that emphasize analytical strategies. Overall, participants display an understanding of the curriculum which covers fundamental concepts of machine learning and statistical modeling.

pdf bib
Automatic Transcription of Grammaticality Judgements for Language Documentation
Éric Le Ferrand | Emily Prud’hommeaux

Descriptive linguistics is a sub-field of linguistics that involves the collection and annotationof language resources to describe linguistic phenomena. The transcription of these resources is often described as a tedious task, and Automatic Speech Recognition (ASR) has frequently been employed to support this process. However, the typical research approach to ASR in documentary linguistics often only captures a subset of the field’s diverse reality. In this paper, we focus specifically on one type of data known as grammaticality judgment elicitation in the context of documenting Kréyòl Gwadloupéyen. We show that only a few minutes of speech is enough to fine-tune a model originally trained in French to transcribe segments in Kréyol.

pdf bib
Fitting a Square Peg into a Round Hole: Creating a UniMorph dataset of Kanien’kéha Verbs
Anna Kazantseva | Akwiratékha Martin | Karin Michelson | Jean-Pierre Koenig

This paper describes efforts to annotate a dataset of verbs in the Iroquoian language Kanien’kéha (a.k.a. Mohawk) using the UniMorph schema (Batsuren et al. 2022a). It is based on the output of a symbolic model - a hand-built verb conjugator. Morphological constituents of each verb are automatically annotated with UniMorph tags. Overall the process was smooth but some central features of the language did not fall neatly into the schema which resulted in a large number of custom tags and a somewhat ad hoc mapping process. We think the same difficulties are likely to arise for other Iroquoian languages and perhaps other North American language families. This paper describes our decision making process with respect to Kanien’kéha and reports preliminary results of morphological induction experiments using the dataset.

pdf bib
Data-mining and Extraction: the gold rush of AI on Indigenous Languages
Marie-Odile Junker

The goal of this paper is to start a discussion on the topic of Data mining and Extraction of Indigenous Language data, describing recent events that took place within the Algonquian Dictionaries and Language Resources common infrastructure. We raise questions about ethics, social context, vulnerability, responsibility, and societal benefits and concerns in the age of generative AI.

pdf bib
Looking within the self: Investigating the Impact of Data Augmentation with Self-training on Automatic Speech Recognition for Hupa
Nitin Venkateswaran | Zoey Liu

We investigate the performance of state-of-the-art neural ASR systems in transcribing audio recordings for Hupa, a critically endangered language of the Hoopa Valley Tribe. We also explore the impact on ASR performance when augmenting a small dataset of gold-standard high-quality transcriptions with a) a larger dataset with transcriptions of lower quality, and b) model-generated transcriptions in a self-training approach. An evaluation of both data augmentation approaches shows that the self-training approach is competitive, producing better WER scores than models trained with no additional data and not lagging far behind models trained with additional lower quality manual transcriptions instead: the deterioration in WER score is just 4.85 points when all the additional data is used in experiments with the best performing system, Wav2Vec. These findings have encouraging implications on the use of ASR systems for transcription and language documentation efforts in the Hupa language.

pdf bib
Creating Digital Learning and Reference Resources for Southern Michif
Heather Souter | Olivia Sammons | David Huggins Daines

Minority and Indigenous languages are often under-documented and under-resourced. Where such resources do exist, particularly in the form of legacy materials, they are often inaccessible to learners and educators involved in revitalization efforts, whether due to the limitations of their original formats or the structure of their contents. Digitizing such resources and making them available on a variety of platforms is one step in overcoming these barriers. This is a major undertaking which requires significant expertise at the intersection of documentary linguistics, computational linguistics, and software development, and must be done while walking alongside speakers and language specialists in the community. We discuss the particular strategies and challenges involved in the development of one such resource, and make recommendations for future projects with a similar goal of mobilizing legacy language resources.

pdf bib
MunTTS: A Text-to-Speech System for Mundari
Varun Gumma | Rishav Hada | Aditya Yadavalli | Pamir Gogoi | Ishani Mondal | Vivek Seshadri | Kalika Bali

We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to build a speech synthesis system. We begin our study by gathering a substantial dataset of Mundari text and speech and train end-to-end speech models. We also delve into the methods used for training our models, ensuring they are efficient and effective despite the data constraints. We evaluate our system with native speakers and objective metrics, demonstrating its potential as a tool for preserving and promoting the Mundari language in the digital age.

pdf bib
End-to-End Speech Recognition for Endangered Languages of Nepal
Marieke Meelen | Alexander O’neill | Rolando Coto-Solano

This paper presents three experiments to test the most effective and efficient ASR pipeline to facilitate the documentation and preservation of endangered languages, which are often extremely low-resourced. With data from two languages in Nepal —Dzardzongke and Newar— we show that model improvements are different for different masses of data, and that transfer learning as well as a range of modifications (e.g. normalising amplitude and pitch) can be effective, but that a consistently-standardised orthography as NLP input and post-training dictionary corrections improve results even more.

pdf bib
Akha, Dara-ang, Karen, Khamu, Mlabri and Urak Lawoi’ language minorities’ subjective perception of their languages and the outlook for development of digital tools
Joanna Dolinska | Shekhar Nayak | Sumittra Suraratdecha

Multilingualism is deeply rooted in the sociopolitical history of Thailand. Some minority language communities entered the Thai territory a few decades ago, while the families of some other minority speakers have been living in Thailand since at least several generations. The authors of this article address the question how Akha, Dara-ang, Karen, Khamu, Mlabri and Urak Lawoi’ language speakers perceive the current situation of their language and whether they see the need for the development of digital tools for documentation, revitalization and daily use of their languages. The objective is complemented by a discussion on the feasibility of development of such tools for some of the above mentioned languages and the motivation of their speakers to participate in this process. Furthermore, this article highlights the challenges associated with developing digital tools for these low-resource languages and outlines the standards researchers must adhere to in conceptualizing the development of such tools, collecting data, and engaging with the language communities throughout the collaborative process.