2025
pdf
bib
abs
SpiRit-LM: Interleaved Spoken and Written Language Model
Tu Anh Nguyen
|
Benjamin Muller
|
Bokai Yu
|
Marta R. Costa-jussa
|
Maha Elbayad
|
Sravya Popuri
|
Christophe Ropers
|
Paul-Ambroise Duquenne
|
Robin Algayres
|
Ruslan Mavlyutov
|
Itai Gat
|
Mary Williamson
|
Gabriel Synnaeve
|
Juan Pino
|
Benoît Sagot
|
Emmanuel Dupoux
Transactions of the Association for Computational Linguistics, Volume 13
We introduce SpiRit-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically curated speech-text parallel corpus. SpiRit-LM comes in two versions: a Base version that uses speech phonetic units (HuBERT) and an Expressive version that models expressivity using pitch and style units in addition to the phonetic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SpiRit-LM can learn new tasks in a few-shot fashion across modalities (i.e., ASR, TTS, Speech Classification). We make available model weights and inference code.1,2
2023
pdf
bib
abs
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
Hirofumi Inaguma
|
Sravya Popuri
|
Ilia Kulikov
|
Peng-Jen Chen
|
Changhan Wang
|
Yu-An Chung
|
Yun Tang
|
Ann Lee
|
Shinji Watanabe
|
Juan Pino
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
pdf
bib
abs
Speech-to-Speech Translation for a Real-world Unwritten Language
Peng-Jen Chen
|
Kevin Tran
|
Yilin Yang
|
Jingfei Du
|
Justine Kao
|
Yu-An Chung
|
Paden Tomasello
|
Paul-Ambroise Duquenne
|
Holger Schwenk
|
Hongyu Gong
|
Hirofumi Inaguma
|
Sravya Popuri
|
Changhan Wang
|
Juan Pino
|
Wei-Ning Hsu
|
Ann Lee
Findings of the Association for Computational Linguistics: ACL 2023
We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating human annotated data, automatically mining data from large unlabeled speech datasets, and adopting pseudo-labeling to produce weakly supervised data. On the modeling, we take advantage of recent advances in applying self-supervised discrete representations as target for prediction in S2ST and show the effectiveness of leveraging additional text supervision from Mandarin, a language similar to Hokkien, in model training. Finally, we release an S2ST benchmark set to facilitate future research in this field.
2022
pdf
bib
abs
Direct Speech-to-Speech Translation With Discrete Units
Ann Lee
|
Peng-Jen Chen
|
Changhan Wang
|
Jiatao Gu
|
Sravya Popuri
|
Xutai Ma
|
Adam Polyak
|
Yossi Adi
|
Qing He
|
Yun Tang
|
Juan Pino
|
Wei-Ning Hsu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages.
pdf
bib
abs
Textless Speech-to-Speech Translation on Real Data
Ann Lee
|
Hongyu Gong
|
Paul-Ambroise Duquenne
|
Holger Schwenk
|
Peng-Jen Chen
|
Changhan Wang
|
Sravya Popuri
|
Yossi Adi
|
Juan Pino
|
Jiatao Gu
|
Wei-Ning Hsu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the VoxPopuli S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data and show an additional 2.0 BLEU gain. To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs.