Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space. In this paper, we conduct systematic studies of the optimization challenges encountered with both the embedding space and the denoising model, which have not been carefully explored. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the embedding space and unstable training. To alleviate this problem, we propose a new objective called the anchor loss which is more efficient than previous methods. Secondly, we find the noise levels of conventional schedules are insufficient for training a desirable denoising model while introducing varying degrees of degeneration in consequence. To address this challenge, we propose a novel framework called noise rescaling. Based on the above analysis, we propose Difformer, an embedding diffusion model based on Transformer. Experiments on varieties of seminal text generation tasks show the effectiveness of the proposed methods and the superiority of Difformer over previous state-of-the-art embedding diffusion baselines.
Diffusion models have achieved significant success in computer vision and shown immense potential in natural language processing applications, particularly for text generation tasks. However, generating high-quality text using these models often necessitates thousands of iterations, leading to slow sampling rates. Existing acceleration methods either neglect the importance of the distribution of sampling steps, resulting in compromised performance with smaller number of iterations, or require additional training, introducing considerable computational overheads. In this paper, we present Few-shot Temporal Pruning, a novel technique designed to accelerate diffusion models for text generation without supplementary training while effectively leveraging limited data. Employing a Bayesian optimization approach, our method effectively eliminates redundant sampling steps during the sampling process, thereby enhancing the generation speed. A comprehensive evaluation of discrete and continuous diffusion models across various tasks, including machine translation, question generation, and paraphrasing, reveals that our approach achieves competitive performance even with minimal sampling steps after down to less than 1 minute of optimization, yielding a significant acceleration of up to 400x in text generation tasks.
While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically, due to the low information density of speech data, the transformed discrete speech unit sequence is much longer than the corresponding text transcription, posing significant challenges to existing auto-regressive models. Furthermore, it is not optimal to brutally apply discrete diffusion on the speech unit sequence while disregarding the continuous space structure, which will degrade the generation performance significantly. In this paper, we propose a novel diffusion model by applying the diffusion forward process in the continuous speech representation space, while employing the diffusion backward process in the discrete speech unit space. In this way, we preserve the semantic structure of the continuous speech representation space in the diffusion process and integrate the continuous and discrete diffusion models. We conduct extensive experiments on the textless direct speech-to-speech translation task, where the proposed method achieves comparable results to the computationally intensive auto-regressive baselines (500 steps on average) with significantly fewer decoding steps (50 steps).