SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content. With discrete speech representations, we first construct SpeechInstruct, a large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi-modal human instructions and highlight the potential of handling multiple modalities with one model. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.


Introduction
Large language models (OpenAI, 2023;Touvron et al., 2023) have performed astonishingly on various natural language processing tasks.Meanwhile, multi-modal large language models, such as GPT-4, PALM-E (Driess et al., 2023), and LLaVA (Liu et al., 2023), have explored the ability of LLMs to understand multi-modal information.However, a significant gap exists between current LLMs and general artificial intelligence (AGI).First, most current LLMs can only perceive and understand multimodal content but cannot spontaneously generate multi-modal content.Second, continuous signals Sure, I will read it now: Sure, I will read it now: like images and speech cannot be adapted directly to LLMs that receive discrete tokens.
The current speech-language model mainly adopts a cascading paradigm (Huang et al., 2023a) i.e., the LLM is connected with an automatic speech recognition (ASR) model or a text-tospeech (TTS) model in tandem, or the LLM is employed as a control hub, with several speech processing models (Cheng et al., 2023a,b,c) are integrated to cover multiple audio or speech tasks (Huang et al., 2023a;Shen et al., 2023).Some prior work on generative spoken language models involves encoding the speech signal into a discrete representation (Baevski et al., 2020;Hsu et al., 2021;Zhang et al., 2023a) and modeling it with language models (Lakhotia et al., 2021;Borsos et al., 2022;Zhang et al., 2023d;Wang et al., 2023;Zhang et al., 2023c).
While capable of perceiving and generating speech, the existing cascaded methods or spoken language models still have several limitations.First, the LLM in the cascaded model only functions as a content generator.Since the representations of speech and text are not aligned, the LLM's knowledge cannot be transferred to the speech modality.Second, the cascade approach (Shen et al., 2023;Huang et al., 2023a) suffers from the loss of paralinguistic signals such as emotion and prosody.Third, existing spoken language models (Wang et al., 2023;Zhang et al., 2023d) only synthesize speech but fail to comprehend its semantic information, preventing them from achieving true crossmodal perception and generation.
In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-modal content.We perform speech discretization with a self-supervised trained speech model to unify the modality between speech and text.The discrete speech tokens are then expanded into the vocabulary of the LLM, thus endowing the model with an inherent competence to perceive and generate the speech.
To provide the model with the capacity to handle multi-modal instructions, we build the first speech-text cross-modal instruction-following dataset SpeechInstruct.Specifically, we discretize the speech to discrete units (Hsu et al., 2021) and construct the cross-modal unit-text pair based on the existing ASR dataset.Meanwhile, we construct hundreds of instructions for diverse tasks with GPT-4 to simulate actual user instructions as illustrated in Appendix B. In addition, to further enhance the model's cross-modal capability, we designed the Chain-of-Modality instruction data, i.e., the model receives the speech command, thinks about the process in text, and then outputs the response in speech.
For better cross-modal transfer and efficient training, SpeechGPT undergoes a three-stage training process: modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-ofmodality instruction fine-tuning.The first stage enables speech comprehension for SpeechGPT with the discrete speech unit continuation task.The second stage employs the SpeechInstruct to improve the model's cross-modal capabilities.The third stage utilizes parameter-efficient LoRA (Hu et al., 2021) fine-tuning for further modality alignment.
To evaluate the effectiveness of SpeechGPT, we conduct a wide range of human evaluations and case analyses to estimate the performance of SpeechGPT on textual tasks, speech-text crossmodal tasks, and spoken dialogue tasks.The results demonstrate that SpeechGPT exhibits a strong ability for unimodal and cross-modal instruction following tasks.
Our contributions include the following: • We build the first multi-modal large language model that can perceive and generate multi-modal contents.nAI, 2023;Huang et al., 2023b;Zhang et al., 2023b).Palm-E (Driess et al., 2023) integrates the 540B PaLM (Chowdhery et al., 2022) and 22B Vision Transformer (Dosovitskiy et al., 2021) into the largest vision-language model.LLaVA (Liu et al., 2023) leverages pre-trained CLIP (Radford et al., 2021) visual encoder andLLaMA (Touvron et al., 2023) and conduct instruct tuning on GPT4assisted visual instruction data.X-LLM (Chen et al., 2023) converts multi-modalities into representations with X2L interfaces as the inputs of the large language model.However, such structures only enable LLMs to process multi-modal input, without ability to generate multi-modal output.Diverging from prior studies, our approach emphasizes the development of a speech-centric multimodal LLM, endowing it with the proficiency to accommodate both multi-modal input and output.
Generative Spoken Language Model Discrete self-supervised representation based spoken generative language modeling is making remarkable progress on large-scale speech dataset training (Nguyen et al., 2022).AudioLM (Borsos et al., 2022) proposes to model speech based on audio codecs together with semantic codes, which can synthesize speech in a textlesss setting.VALL-E (Wang et al., 2023) builds a generative spoken language model on audio codecs and treat Textto-Speech as a conditional generation task.However, these models are designed for a specific task and failed to benefit from LLMs.SpeechGPT is built upon the foundation of LLM and transfers LLM's knowledge to speech modality, con-sequently obtaining better task generalization and human-instruction following ability.Speech-Enabled LLM Interaction Following the emergence of ChatGPT, several studies have concentrated on the integration of expert speech models with LLMs to enable direct speech interaction with LLMs.HuggingGPT (Shen et al., 2023) facilitates task decomposition of human instructions by LLMs and allows the invocation of models from Huggingface to accomplish specific tasks, encompassing a range of automatic speech recognition (ASR) and text-to-speech models.Audio-GPT (Huang et al., 2023a) leverages a variety of audio foundation models to process complex audio information and connect LLMs with input/output interface (ASR, TTS) for speech conversations.However, these models exhibit increased complexity, demand extensive resources, and are prone to the unavoidable error accumulation problems.Our approach enables speech interaction with LLMs without relying on ASR or TTS systems, circumventing the aforementioned drawbacks.

SpeechInstruct Construction
Due to the limitations in publicly available speech data and the lack of variety of speech-text tasks, we construct SpeechInstruct, a speech-text crossmodal instruction-following dataset.This dataset consists of two parts, the first part is called Cross-Modal Instruction, and the second part is called Chain-of-Modality Instruction.The construction process of SpeechInstruct is illustrated in Figure 2.

Cross-modal Instruction
Data Collection We collect several large-scale English ASR datasets to construct Cross-Modal Instruction, including Gigaspeech (Chen et al., 2021), Common Voice (Ardila et al., 2020), and LibriSpeech (Panayotov et al., 2015).We employ mHuBERT1 as the speech tokenizer to discretize speech data into discrete units and remove the repetitive units of adjacent frames to get reduced units.Ultimately, we obtain 9 million unit-text data pairs.Task Description Generation We generate ASR and TTS task descriptions that are compatible with speech-text data pairs.Unlike the Self-Instruct method (Wang et al., 2022), we generate descriptions through a zero-shot approach.Specifically, we directly input the prompts shown in Appendix A into OpenAI GPT-4 to generate task descriptions.
Our generation method yields 100 instructions for each task and some examples are shown in Appendix B.
Instruction Formatting For a discrete unit sequence U and its associated transcription T , we determine whether it will be used for constructing an ASR task or a TTS task based on the probability p. Subsequently, we randomly select a description D from the corresponding task description.This results in a triplet consisting of the task description, discrete unit sequence, and transcription, denoted as (D, U, T ).Following this, the triplet is assembled into an instruction using the template: [Human]:{D}.This is input: {U }<eoh>.

Chain-of-Modality Instruction
Speech Instruction Generation Due to the lack of instruction data with speech input and speech output, we trained a text-to-unit generator to convert text instruction data into speech instruction data.Specifically, the text-to-unit generator adopts a Transformer encoder-decoder architecture.We trained it on LibriSpeech unit-text pairs in Crossmodal Instruction.We select 37,969 samples from the moss-002-sft-data dataset2 whose response length is shorter than 35 words.And we convert both their instructions and responses into unit sequences through the text-to-unit generator.As a result, we obtained 37,969 quadruplets composed of speech instructions, text instructions, text responses, and speech responses, denoted as (SpeechI, T extI, T extR, SpeechR).
Instruction Formatting Using the above quadruplets, we could construct chain-of-thought style instructions for four input-output formats, namely Speech Instruction-Speech Response, Speech Instruction-Text Response, Text Instruction-Speech Response, and Text Instruction-Text Response.
Their corresponding templates can be found in Appendix C.

SpeechInstruct Evaluation Set
We constructed cross-modal dialogue datasets under different scenarios to evaluate whether SpeechGPT could take on various roles.Specifically, these included a talking encyclopedia, personal assistant, chat partner, poet, psychologist,   and educational assistant.For each role, we provide 10 manually authored instruction-response pairs written by ourselves.We use a pre-trained text-to-speech model3 to convert the text into corresponding speech.We then employ mHuBERT to discretize speech data into discrete units as described in Section 3.1.Ultimately, for each role, we obtained 10 quadruplets composed of speech instructions, text instructions, text responses, and speech responses.

Model Structure
A unified framework is designed to provide architecture compatibility across different modalities.As shown in Figure 2, our model consists of three main components: discrete unit extractor, large language modal and unit vocoder.Under this architecture, LLM can perceive multi-modal inputs and generate multi-modal outputs.Discrete Unit Extractor The discrete unit extractor utilizes the Hidden-unit BERT (HuBERT) model (Hsu et al., 2021)  Unit Vocoder Due to limition of single speaker unit vocoder in (Polyak et al., 2021), we train a multi-speaker unit HiFi-GAN to decode the speech signal from the discrete representation.The HiFi-GAN architecture consists of a generator G and multiple discriminators D. The generator uses look-up tables (LUT) to embed discrete representations and the embedding sequences are up-sampled by a series of blocks composed of transposed convolution and a residual block with dilated layers.The speaker embedding is concatenated to each frame in the up-sampled sequence.The discriminator features a Multi-Period Discriminator (MPD) and a Multi-Scale Discriminator (MSD), which have the same architecture as (Polyak et al., 2021).

Training
To incorporate speech discrete representation into LLM, we expand the vocabulary and corresponding embedding matrix first.We divide the training process into three stages.The first stage is Modality-Adaptation Pre-training on unpaired speech data.The second stage is Cross-modal Instruction Fine-Tuning.The third stage is Chain-of-Modality Instruction Fine-Tuning.Expanding Vocabulary Given original LLM vocabulary V of size |V |, to integrate speech discrete representations into LLM, we expand the vocabulary with an additional set of unit tokens V ′ , of size |V ′ | = K.The expanded vocabulary V ′′ is the union of the original vocabulary V and the new words V ′ : We denote the original word embedding matrix as E ∈ R |V |×d , where d is the dimension of word embeddings.To accommodate the expanded vocabulary, we need to create a randomly initialized word embedding matrix E ′ ∈ R |V ′′ |×d .We preserve the original word embeddings by copying the values of E to the first |V | rows of E ′ : Finally, we replace the original vocabulary and word embedding matrix with the new vocabulary V ′′ and the word embedding matrix E ′ .
Stage 1: Modality-Adaptation Pre-training To enable LLM to handle discrete units modality, we utilize an unlabeled speech corpus to train LLM in a next-token prediction task.This approach aligns with the text pre-training objective of LLM.Given unlabeled speech corpus C consisting of speech U 1 , U 2 , . . ., U m and LLM denoted as L 1 , the negative log-likelihood loss can be formulated as: where m is the number of speech in dataset C, n j is the number of discrete unit token in speech U j , and u i,j represents the i-th unit token in the j-th speech.
Stage 2: Cross-modal Instruction Fine-Tuning In this stage, we align speech and text modalities utilizing paired data.We mix Crossmodal Instruction in SpeechInstruct with moss-002sft dataset to derive mix dataset I, which consists of samples T 1 , T 2 , . . ., T x .We fine-tune the model L obtained from the first stage on I.Each sample T j consisting of t 1 , t 2 , . . ., t n j is formed by concatenating a prefix and a text.The training objective is to minimize the negative loglikelihood and the loss calculation only considers the text part, ignoring the prefix, which can be formated as: where x is the number of samples in corpus I, y j is the total number of tokens in sample T j , p j is the number of tokens in the prefix part of T j , and t i,j represents the i-th word in T j .Stage 3: Chain-of-Modality Instruction Fine-Tuning After obtaining the model in stage 2, we utilizes parameter-efficient Low-Rank Adaptation (LoRA) (Hu et al., 2021) to fine-tune it on Chain-of-Modality Instruction in SpeechInstruct.We add LoRA weights (adapters) to the attention mechanisms and train the newly added LoRA parameters.We adopt the same loss function as stage 2.

Experimental Setups
Datasets For modality-adaption pre-training, we use LibriLight (Kahn et al., 2020) which contains 60K hours of unlabelled English audiobook speech.For cross-modal instruction fine-tuning stage, we use Gigaspeech (Chen et al., 2021), Common voice (Ardila et al., 2020) and LibriSpeech (Panayotov et al., 2015) dataset and moss-002-sft-data dataset, which is illustrated in detail in 3.1.For chain-of-modality instruction fine-tuning stage, we use moss-002-sft-data dataset, which is illustrated in detail in 3.2.Configuration We employ LLaMA-13B (Touvron et al., 2023) as our backbone model for a trade-off between performance and computational resources available.For stage 1, we use 96 A100 GPUs and train for 900 steps with batch size 768.For stage 2, we use 96 A100 GPUs and train for 2100 steps with batch size 1536.For stage 3, we use 8 A100 GPUs and train for 4200 steps with batch size 128.
Details about training hyperparameters are shown in Appendix D. For decoding, we set the maximum sequence length to 2048 and set the temperature to 0.8.We use Top-k sampling with k=60.We also use Top-p sampling with p=0.8.

Baselines
We establish two cascaded cross-modal conversational systems as our baselines.The first model, referred to as Speech-Alpaca-13B, consists of an offthe-shell ASR system4 , Alpaca 13B (Taori et al., 2023) as well as a pre-trained TTS system5 .The second model, named Speech-LLaMA-MOSS-002, incorporates the same ASR and TTS system, along with a large language model obtained by performing supervised fine-tuning on LLaMA-13B using MOSS-sft-002 as the training dataset.

Evaluation
We evaluate the cross-modal instruction-following capabilities of SpeechGPT across four tasks: speech-to-speech instruction-following (S2SIF), speech-to-text instruction-following (S2TIF), textto-speech instruction-following (T2SIF), and textto-text instruction-following (T2TIF).Data We randomly select 40 samples from the AlpacaEval dataset6 and use the pre-trained TTS model in Section 3.3 to convert the text into corresponding speech.We then employ mHuBERT to discretize speech data into discrete units as described in Section 3.1.These are combined with the SpeechInstruct Evaluation Set to constitute our test set, which contains 100 samples.Each sample is a quadruplet composed of a speech instruction, text instruction, text response, and speech response.We denote them as ground truth.ChatGPT Score We utilize ChatGPT (GPT-3.5-turbo) to assess the cross-modal instructionfollowing performance.For tasks that include speech, we leveraged the pre-trained ASR model in section 5.2 to transform the speech into its corresponding text, which is subsequently submitted for evaluation.Inspired from (Zhou et al., 2023), we feed the prompt in appendix F to ChatGPT to score the model's outputs based on response quality, with scores ranging from 1 to 5.
Human Opinion Score Following (Nguyen et al., 2022), we calculate the human opinion score of the generated examples through crowdsourcing.These opinions are based on two dimensions: the content mean opinion score (CMOS) for content and meaningfulness quality, and the naturalness mean opinion score (NMOS) for speech naturalness and fluency.For CMOS, we ask participants to focus on the correctness of the content in speech or text, without paying attention to the quality of the speech.For NMOS, we direct participants to focus on the quality, smoothness, and naturalness of the speech, without considering its content.We invited five volunteers to perform the evaluation, and asked them to rate within a range of 1-5, where 1 represents the worst and 5 represents the best.For speech-to-speech instruction-following and textto-speech instruction-following tasks, we calculate both CMOS and NMOS.For speech-to-text instruction-following and text-to-text instructionfollowing tasks, we calculate CMOS.

Main Results
Content As shown in Table 1, taking into account the comprehensive evaluation of ChatGPT Score and CMOS, SpeechGPT demonstrates superior performance on speech instructions (S2SIF and S2TIF) compared to the two baseline systems.This indicates that SpeechGPT outperforms the ASR model in the cascaded system when it comes to understanding speech content.From the perspective of CMOS, SpeechGPT achieves performance similar to the baseline systems on T2SIF and T2TIF tasks, indicating that SpeechGPT still possesses commendable text and speech generation capabilities.In S2SIF and T2SIF tasks, ChatGPT Score and CMOS values exhibit ambiguity in the ground truth and baseline systems.This can be attributed to speech responses being synthesized by TTS system, which can have errors in pauses between sentences.This introduces significant errors for longer responses, leading to incorrect text after being processed by the ASR system, thereby reducing the ChatGPT score.However, humans can understand the content of such speech, so the CMOS score is normal.Cases of cross-modal instructionfollowing can be found in Appendix G. Speech Quality As shown in Table 1, SpeechGPT exhibits significantly higher NMOS values compared to the baseline systems.This indicates that the speech responses generated by SpeechGPT out-    perform the TTS system in the cascaded system in terms of audio quality and prosody.More detailed speech prosody analysis are located in Section ??.

Chain-of-modality prompting matters
Table 2 shows ChatGPT Scores on speech-tospeech instruction-following task for models utilizing standard prompting and chain-of-modality prompting during training and inference stages respectively.Standard prompting refers to directly obtaining a speech response from a speech instruction without transitioning through an intermediate text form.The template can be located in Appendix E. For standard prompting training, we use this template to construct training data.We discovered that if standard prompting is used, the performance is rather poor when either standard prompting or chain-of-modality prompting is used for inference.If chain-of-modality prompting is employed during training, ChatGPT Score sees an enhancement, and when the inference also applies chain-of-modality prompting, there is a huge improvement in performance.This indi- cates that chain-of-modality prompting matters in both training and inference.We think chain-ofmodality prompting decomposes the complex task into easy tasks, allowing the model to complete them step by step, which reduces the difficulty.

Can text knowledge benefit speech modality?
SpeechGPT originates from a text pre-trained model, LLaMA.Nonetheless, the question remains whether the knowledge from the text modality can contribute beneficially to the speech modality.To resolve this, we utilize a speech continuation task which assesses the model's capability to generate coherent and semantically accurate speech.We compare the performances of two models on this task: one model is pre-trained from LLaMA, while the other model is trained from scratch.
We utilize LibriSpeech test-clean set for evaluation, where we randomly select 100 utterances, and use the first 3 seconds of each utterance as a prompt.The 3-second speech prompt is converted into discrete units by mHuBERT.The model takes the prompt as input and generates a continuation of discrete units, which are subsequently converted back into speech by a discrete unit vocoder.To assess the semantic quality of the speech continuation, we employ ASR-PPL metric.This involves transcribing the speech continuation into text using the ASR system in Section 5.2 and calculating the perplexity of the transcripts using GPT-3.5 text-devinci-003 model.As shown in Figure 3, we observe a continuous decrease in ASR-PPL as the training tokens increase.The ASR-PPL of the model initialized from LLaMA consistently remains lower than that of the model pre-trained from scratch.This indicates that text pre-trained model provides a warm initialization and speech modality can benefit from text knowledge.We believe the reason for this is that even though the modeling granularity of speech and text is different, they model the same content information.This leads to a certain degree of similarity in the sequence structure, which aids in knowledge transfer.

Does SpeechGPT Sacrifice Text Capability
as a Trade-off?
Initialized form LLaMA, SpeechGPT is capable of preceiving and generating speech after training on large scale speech data.However, does SpeechGPT sacrifice text capability as a trade-off?
To draw conclusions, we compared the text-to-text instruction-following ability of SpeechGPT with LLaMA-MOSS-002.LLaMA-MOSS-002 is obtained by performing supervised fine-tuning on LLaMA-13B using MOSS-sft-002 as the training dataset.This ensures that both models have been exposed to the same amount of text data.We evaluated both models using the test set from Section 5.3.As depicted in Figure 4, with an increase in training samples, both LLaMA-MOSS-002 and SpeechGPT's ChatGPT Score gradually improve.Although SpeechGPT consistently remains lower than LLaMA-MOSS-002.the performance gap between them gradually decreases.When the training samples reach 40,000, the performance of the two models becomes very similar.This suggests that SpeechGPT still retains text capability.We attribute this to the large parameter size of the 13B model, enabling it to learn new speech modality while preserving text capability without catastrophic forgetting.

Conclusion
This work presents SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-modal content.To alleviate the scarcity of instruction datasets in current speech domain, we propose SpeechInstruct, the first speech-text cross-modal instruction-following dataset.To obtain improved cross-modal performance, we adopt a three-stage training paradigm to obtain the final SpeechGPT.Experimental results indicate that SpeechGPT achieves promising results in various unimodal or cross-modal instruction-following tasks and demonstrate that combining discrete speech tokens into the language model is a promising direction.

Limitation
Despite SpeechGPT exhibiting impressive crossmodal instruction following and spoken dialogue abilities, it still presents certain limitations: 1) Due to the audio discretization technique constraints, SpeechGPT does not explicitly model the paralinguistic information included in the speech signal.
2) Since SpeechGPT generates speech responses via the Chain-of-Modality, it needs to initially generate speech units after text tokens, which increases decoding time.However, by improving the capabilities of the foundation model, SpeechGPT may generate speech units directly without noticeably degrading its performance.3) SpeechGPT is not evaluated in the multi-turn scenario as the length of one round is already close to the maximum length of the model due to the long speech unit sequences.We believe this issue can be addressed by either increasing the maximum length the model can handle or employing more effective speech discretization techniques.The generated text has some relevance to the user's question, but it may be unclear or incomplete.It provides only partial information, or the information provided may not be useful for the user's needs.""3": "Moderately helpful -The generated text is relevant to the user's question, and it provides a clear and complete answer.However, it may lack detail or explanation that would be helpful for the user.""4": "Helpful -The generated text is quite relevant to the user's question, and it provides a clear, complete, and detailed answer.It offers additional information or explanations that are useful for the user.However, some of the points of the response are somewhat repetitive or could be combined for greater clarity and concision" "5": "Very helpful -The generated text is highly relevant to the user's question, and it provides a clear, complete, and detailed answer.It offers additional information, explanations, or analogies that are not only useful but also insightful and valuable to the user.However, the structured of the response is not well-organized and there is no clear progression or logical sequence of different points in the response."*** [END DATA] Does the response meet the criterion?You should only write out your score in this format: "My score is: "

Figure 2 :
Figure 2: Left: An overview of SpeechInstruct construction process.The SpeechInstruct dataset consists of two parts: Cross-modal Instruction data and Chain-of-Modality Instruction data.T emplate 1 is shown in 3.1.T emplate 2 is shown in Appendix C. Right: An illustration of SpeechGPT model structure.
to transform continuous speech signals into a sequence of discrete units, .HuBERT is a self-supervised model that learns by predicting discrete labels for masked audio segments based on k-means clustering applied to the model's intermediate representations.It features a combination of 1-D convolutional layers and a Transformer encoder to encode speech into continuous intermediate representations, with a k-means model further converting these representations into a sequence of cluster indices.Subsequently, adjacent duplicate indices are removed, resulting in a discrete units sequence represented as U = (u 1 , u 2 , . . ., u T ), u i ∈ 0, 1, . . ., K − 1, ∀1 ≤ i ≤ T , with K denoting the total number of clusters.Large Language Model We employ the Meta AI LLaMA (Touvron et al., 2023) model as our Large Language Model.LLaMA comprises an embedding layer, multiple transformer blocks, and an LM head layer.The total number of parameters in LLaMA ranges from 7B to 65B.Drawing from an extensive training dataset of 1.0 trillion tokens, LLaMA demonstrates competitive performance compared to the substantially larger 175B GPT-3 across various NLP benchmarks.
S2SIF refers to speech-to-speech instruction-following, S2TIF is speech-totext instruction-following, T2SIF denotes text-to-speech instruction-following and T2TIF represents text-to-text instruction-following.ChatGPT score is obtained through ChatGPT evaluatation.CMOS refers to content mean opinion score.NMOS denotes naturalness mean opinion score.* : The low ChatGPT Score for speech responses in Ground Truth is due to them being synthesized by TTS system, which can have errors in pauses between sentences.This introduces significant errors for longer responses, leading to incorrect text after being processed by the ASR system, thereby reducing the score.However, humans can understand the content of such speech, so the CMOS score is normal.

Figure 3 :
Figure 3: ASR-PPL of speech continue task on 100 utterances from LibriSpeech test-clean set.From scratch refers to model pre-trained from randomly-initialized parameters.From LLaMA denotes model pre-trained from LLaMA.
Speech Instruction-Speech Response: [Human]: This is a speech instruction: {SpeechI}.And your response should be speech.You can do it step by step.You can first transcribe the instruction and get the text Instruction.Then you can think about the instruction and get the text response.Last, you should speak the response aloud <eoh>.[SpeechGPT]: [tq] {TextI}; [ta] {TextR}; [ua] {SpeechR}<eoa>.Speech Instruction-Text Response: [Human]: This is a speech instruction: {SpeechI}.And your response should be text.You can do it step by step.You can first transcribe the instruction and get the text instruction.Then you can think about the instruction and get the text response.<eoh>.[SpeechGPT]: [tq] {TextI}; [ta] {TextR}<eoa>.Text Instruction-Speech Response: [Human]: This is a text instruction: {TextI}.And your response should be speech.You can do it step by step.You can think about the instruction and get the text response.Then you should speak the response aloud <eoh>.[SpeechGPT]: [ta] {TextR}; [ua] {SpeechR}<eoa>.Text Instruction-Text Response: [Human]: This is a text instruction: {TextI}.And your response should be text.You can think about the instruction and get the text response.[SpeechGPT]: [ta] {TextR}<eoa>.

Table 1 :
Main Results of SpeechGPT.

Table 2 :
ChatGPT Score on speech-to-speech instruction-following task.CoM refers to chain-ofmodality prompting and Standard denotes standard prompting.

Table 3 :
SpeechGPT training hyperparameters.F ChatGPT Score Evaluation PromptYou are evaluating a response that has been submitted for an instruction, using a specific set of standards.Below is the data: Not helpful -The generated text is completely irrelevant, unclear, or incomplete.It does not provide any useful information to the user.""2": "Somewhat helpful -