2024
pdf
bib
abs
Perception of Phonological Assimilation by Neural Speech Recognition Models
Charlotte Pouw
|
Marianne de Heer Kloots
|
Afra Alishahi
|
Willem Zuidema
Computational Linguistics, Volume 50, Issue 4 - December 2024
Human listeners effortlessly compensate for phonological changes during speech perception, often unconsciously inferring the intended sounds. For example, listeners infer the underlying /n/ when hearing an utterance such as “clea[m] pan”, where [m] arises from place assimilation to the following labial [p]. This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). Using psycholinguistic stimuli, we systematically analyze how various linguistic context cues influence compensation patterns in the model’s output. Complementing these behavioral experiments, our probing experiments indicate that the model shifts its interpretation of assimilated sounds from their acoustic form to their underlying form in its final layers. Finally, our causal intervention experiments suggest that the model relies on minimal phonological context cues to accomplish this shift. These findings represent a step towards better understanding the similarities and differences in phonological processing between neural ASR models and humans.
pdf
bib
abs
GTNC: A Many-To-One Dataset of Google Translations from NewsCrawl
Damiaan Reijnaers
|
Charlotte Pouw
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
This paper lays the groundwork for initiating research into Source Language Identification; the task of identifying the original language of a machine-translated text. We contribute a dataset of translations from a typologically diverse spectrum of languages into English and use it to set initial baselines for this novel task.
2023
pdf
bib
abs
“Geen makkie”: Interpretable Classification and Simplification of Dutch Text Complexity
Eliza Hobo
|
Charlotte Pouw
|
Lisa Beinborn
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
An inclusive society needs to facilitate access to information for all of its members, including citizens with low literacy and with non-native language skills. We present an approach to assess Dutch text complexity on the sentence level and conduct an interpretability analysis to explore the link between neural models and linguistic complexity features. Building on these findings, we develop the first contextual lexical simplification model for Dutch and publish a pilot dataset for evaluation. We go beyondprevious work which primarily targeted lexical substitution and propose strategies for adjusting the model’s linguistic register to generate simpler candidates. Our results indicate that continual pre-training and multi-task learning with conceptually related tasks are promising directions for ensuring the simplicity of the generated substitutions.
pdf
bib
ChapGTP, ILLC’s Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation
Jaap Jumelet
|
Michael Hanna
|
Marianne de Heer Kloots
|
Anna Langedijk
|
Charlotte Pouw
|
Oskar van der Wal
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning
pdf
bib
abs
Cross-Lingual Transfer of Cognitive Processing Complexity
Charlotte Pouw
|
Nora Hollenstein
|
Lisa Beinborn
Findings of the Association for Computational Linguistics: EACL 2023
When humans read a text, their eye movements are influenced by the structural complexity of the input sentences. This cognitive phenomenon holds across languages and recent studies indicate that multilingual language models utilize structural similarities between languages to facilitate cross-lingual transfer. We use sentence-level eye-tracking patterns as a cognitive indicator for structural complexity and show that the multilingual model XLM-RoBERTa can successfully predict varied patterns for 13 typologically diverse languages, despite being fine-tuned only on English data. We quantify the sensitivity of the model to structural complexity and distinguish a range of complexity characteristics. Our results indicate that the model develops a meaningful bias towards sentence length but also integrates cross-lingual differences. We conduct a control experiment with randomized word order and find that the model seems to additionally capture more complex structural information.