Linguistically Informed Tokenization Improves ASR for Underresourced Languages

Massimo Marie Daul, Alessio Tosolini, Claire Bowern


Abstract
Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems rely on data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec 2.0 ASR model on Yanyhangu, an Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR’s viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves word error rate (WER) and character error rate (CER) compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can provide significant assistance for underresourced language documentation.
Anthology ID:
2026.fieldmatters-1.4
Volume:
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
FieldMatters | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
31–37
Language:
URL:
https://aclanthology.org/2026.fieldmatters-1.4/
DOI:
Bibkey:
Cite (ACL):
Massimo Marie Daul, Alessio Tosolini, and Claire Bowern. 2026. Linguistically Informed Tokenization Improves ASR for Underresourced Languages. In Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics, pages 31–37, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Linguistically Informed Tokenization Improves ASR for Underresourced Languages (Daul et al., FieldMatters 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.fieldmatters-1.4.pdf