Alessio Tosolini

2026

Linguistically Informed Tokenization Improves ASR for Underresourced Languages
Massimo Marie Daul | Alessio Tosolini | Claire Bowern
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics

Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems rely on data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec 2.0 ASR model on Yanyhangu, an Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR’s viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves word error rate (WER) and character error rate (CER) compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can provide significant assistance for underresourced language documentation.

2025

pdf bib abs

Multilingual MFA: Forced Alignment on Low-Resource Related Languages
Alessio Tosolini | Claire Bowern
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

We compare the outcomes of multilingual and crosslingual training for related and unrelated Australian languages with similar phonologi- cal inventories. We use the Montreal Forced Aligner to train acoustic models from scratch and adapt a large English model, evaluating results against seen data, unseen data (seen lan- guage), and unseen data and language. Results indicate benefits of adapting the English base- line model for previously unseen languages.

pdf bib abs

Analyzing the Linguistic Priors of Language Models with Synthetic Languages
Alessio Tosolini | Terra Blevins
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

While modern language model architectures are often assumed to be language-agnostic, there is limited evidence as to whether these models actually process the wide diversity of natural languages equally well. We investigate this question by analyzing how well LMs learn carefully constructed artificial languages containing a variety of verbal complexity, ranging from simple paradigms to covering far more verb classes than occur in natural languages. Rather than learning all languages equally efficiently, models trained on these languages show strict preferences for processing simpler languages. Furthermore, while some observed behaviors mimic human linguistic priors, we find that they indicate the model memorizes its training data rather than generalizes from it.

Co-authors

Venues

Fix author