Claire Bowern
2026
Linguistically Informed Tokenization Improves ASR for Underresourced Languages
Massimo Marie Daul | Alessio Tosolini | Claire Bowern
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics
Massimo Marie Daul | Alessio Tosolini | Claire Bowern
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics
Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems rely on data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec 2.0 ASR model on Yanyhangu, an Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR’s viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves word error rate (WER) and character error rate (CER) compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can provide significant assistance for underresourced language documentation.
2025
Multilingual MFA: Forced Alignment on Low-Resource Related Languages
Alessio Tosolini | Claire Bowern
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages
Alessio Tosolini | Claire Bowern
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages
We compare the outcomes of multilingual and crosslingual training for related and unrelated Australian languages with similar phonologi- cal inventories. We use the Montreal Forced Aligner to train acoustic models from scratch and adapt a large English model, evaluating results against seen data, unseen data (seen lan- guage), and unseen data and language. Results indicate benefits of adapting the English base- line model for previously unseen languages.
2023
FileLingR: An R Script validation tool for depositors and users of digital language collections
Irene Yi | Claire Bowern
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Irene Yi | Claire Bowern
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages
2019
Semantic Change and Semantic Stability: Variation is Key
Claire Bowern
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change
Claire Bowern
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change
I survey some recent approaches to studying change in the lexicon, particularly change in meaning across phylogenies. I briefly sketch an evolutionary approach to language change and point out some issues in recent approaches to studying semantic change that rely on temporally stratified word embeddings. I draw illustrations from lexical cognate models in Pama-Nyungan to identify meaning classes most appropriate for lexical phylogenetic inference, particularly highlighting the importance of variation in studying change over time.