Self-supervised speech representations display some human-like cross-linguistic perceptual abilities

Joselyn Rodriguez, Kamala Sreepada, Ruolan Leslie Famularo, Sharon Goldwater, Naomi Feldman


Abstract
State of the art models in automatic speech recognition have shown remarkable improvements due to modern self-supervised (SSL) transformer-based architectures such as wav2vec 2.0 (Baevski et al., 2020). However, how these models encode phonetic information is still not well understood. We explore whether SSL speech models display a linguistic property that characterizes human speech perception: language specificity. We show that while wav2vec 2.0 displays an overall language specificity effect when tested on Hindi vs. English, it does not resemble human speech perception when tested on finer-grained differences in Hindi speech contrasts.
Anthology ID:
2024.conll-1.35
Volume:
Proceedings of the 28th Conference on Computational Natural Language Learning
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Libby Barak, Malihe Alikhani
Venue:
CoNLL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
458–463
Language:
URL:
https://aclanthology.org/2024.conll-1.35
DOI:
Bibkey:
Cite (ACL):
Joselyn Rodriguez, Kamala Sreepada, Ruolan Leslie Famularo, Sharon Goldwater, and Naomi Feldman. 2024. Self-supervised speech representations display some human-like cross-linguistic perceptual abilities. In Proceedings of the 28th Conference on Computational Natural Language Learning, pages 458–463, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
Self-supervised speech representations display some human-like cross-linguistic perceptual abilities (Rodriguez et al., CoNLL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.conll-1.35.pdf