RepCodec: A Speech Representation Codec for Speech Tokenization

Zhichao Huang, Chutong Meng, Tom Ko


Abstract
With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Furthermore, this superiority extends across various speech encoders and languages, affirming the robustness of RepCodec.We believe our method can facilitate large language modeling research on speech processing.
Anthology ID:
2024.luhme-long.314
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5777–5790
Language:
URL:
https://aclanthology.org/2024.luhme-long.314/
DOI:
10.18653/v1/2024.acl-long.314
Bibkey:
Cite (ACL):
Zhichao Huang, Chutong Meng, and Tom Ko. 2024. RepCodec: A Speech Representation Codec for Speech Tokenization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5777–5790, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
RepCodec: A Speech Representation Codec for Speech Tokenization (Huang et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.314.pdf