Multilingual Self-supervised Visually Grounded Speech Models

Huynh Phuong Thanh Nguyen, Sakriani Sakti


Abstract
Developing a multilingual speech-to-speech translation system poses challenges due to the scarcity of paired speech data in various languages, particularly when dealing with unknown and untranscribed languages. However, the shared semantic representation across multiple languages presents an opportunity to build a translation system based on images. Recently, researchers have explored methods for aligning bilingual speech as a novel approach to discovering speech pairs using semantic images from unknown and untranscribed speech. These aligned speech pairs can then be utilized to train speech-to-speech translation systems. Our research builds upon these approaches by expanding into multiple languages and focusing on achieving multimodal multilingual pairs alignment, with a key component being multilingual visually grounded speech models. The objectives of our research are twofold: (1) to create visually grounded speech datasets for English, Japanese, Indonesian, and Vietnamese, and (2) to develop self-supervised visually grounded speech models for these languages. Our experiments have demonstrated the feasibility of this approach, showcasing the ability to retrieve associations between speeches and images. The results indicate that our multilingual visually grounded speech models yield promising outcomes in representing speeches using semantic images across multiple languages.
Anthology ID:
2024.sigul-1.28
Volume:
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venues:
SIGUL | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
237–243
Language:
URL:
https://aclanthology.org/2024.sigul-1.28
DOI:
Bibkey:
Cite (ACL):
Huynh Phuong Thanh Nguyen and Sakriani Sakti. 2024. Multilingual Self-supervised Visually Grounded Speech Models. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 237–243, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Multilingual Self-supervised Visually Grounded Speech Models (Nguyen & Sakti, SIGUL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sigul-1.28.pdf