Recycle Your Wav2Vec2 Codebook: A Speech Perceiver for Keyword Spotting

Guillermo Cámbara, Jordi Luque, Mireia Farrús


Abstract
Speech information in a pretrained wav2vec2.0 model is usually leveraged through its encoder, which has at least 95M parameters, being not so suitable for small footprint Keyword Spotting. In this work, we show an efficient way of profiting from wav2vec2.0’s linguistic knowledge, by recycling the phonetic information encoded in its latent codebook, which has been typically thrown away after pretraining. We do so by transferring the codebook as weights for the latent bottleneck of a Keyword Spotting Perceiver, thus initializing such model with phonetic embeddings already. The Perceiver design relies on cross-attention between these embeddings and input data to generate better representations. Our method delivers accuracy gains compared to random initialization, at no latency costs. Plus, we show that the phonetic embeddings can easily be downsampled with k-means clustering, speeding up inference in 3.5 times at only slight accuracy penalties.
Anthology ID:
2022.coling-1.626
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
7166–7170
Language:
URL:
https://aclanthology.org/2022.coling-1.626
DOI:
Bibkey:
Cite (ACL):
Guillermo Cámbara, Jordi Luque, and Mireia Farrús. 2022. Recycle Your Wav2Vec2 Codebook: A Speech Perceiver for Keyword Spotting. In Proceedings of the 29th International Conference on Computational Linguistics, pages 7166–7170, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Recycle Your Wav2Vec2 Codebook: A Speech Perceiver for Keyword Spotting (Cámbara et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.626.pdf
Data
Speech Commands