The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

Jian Zhu, Changbing Yang, Farhan Samir, Jahurul Islam


Abstract
In this project, we demonstrate that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We curated the IPAPACK, a massively multilingual speech corpora with phonemic transcriptions, encompassing more than 115 languages from diverse language families, selectively checked by linguists. Based on the IPAPACK, we propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences. The proposed model was tested on 95 unseen languages, showing strong generalizability across languages. Temporal alignments between phonemes and speech signals also emerged from contrastive training, enabling zeroshot forced alignment in unseen languages. We further introduced a neural forced aligner IPA-ALIGNER by finetuning CLAP-IPA with the Forward-Sum loss to learn better phone-to-audio alignment. Evaluation results suggest that IPA-ALIGNER can generalize to unseen languages without adaptation.
Anthology ID:
2024.naacl-long.43
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
750–772
Language:
URL:
https://aclanthology.org/2024.naacl-long.43
DOI:
Bibkey:
Cite (ACL):
Jian Zhu, Changbing Yang, Farhan Samir, and Jahurul Islam. 2024. The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 750–772, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language (Zhu et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-long.43.pdf
Copyright:
 2024.naacl-long.43.copyright.pdf