Maxime Poli
2024
Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach
Maxime Poli
|
Emmanuel Chemla
|
Emmanuel Dupoux
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.