Tonguescape: Exploring Language Models Understanding of Vowel Articulation

Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe


Abstract
Vowels are primarily characterized by tongue position. Humans have discovered these features of vowel articulation through their own experience and explicit objective observation such as using MRI. With this knowledge and our experience, we can explain and understand the relationship between tongue positions and vowels, and this knowledge is helpful for language learners to learn pronunciation. Since language models (LMs) are trained on a large amount of data that includes linguistic and medical fields, our preliminary studies indicate that an LM is able to explain the pronunciation mechanisms of vowels. However, it is unclear whether multi-modal LMs, such as vision LMs, align textual information with visual information. One question arises: do LMs associate real tongue positions with vowel articulation? In this study, we created video and image datasets from the existing real-time MRI dataset and investigated whether LMs can understand vowel articulation based on tongue positions using vision-based information. Our findings suggest that LMs exhibit potential for understanding vowels and tongue positions when reference examples are provided while they have difficulties without them. Our code for dataset building is available on GitHub.
Anthology ID:
2025.naacl-long.627
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12605–12619
Language:
URL:
https://aclanthology.org/2025.naacl-long.627/
DOI:
Bibkey:
Cite (ACL):
Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. 2025. Tonguescape: Exploring Language Models Understanding of Vowel Articulation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12605–12619, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Tonguescape: Exploring Language Models Understanding of Vowel Articulation (Sakajo et al., NAACL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.naacl-long.627.pdf