Wei-Ning Hsu


2022

pdf bib
Unified Speech-Text Pre-training for Speech Translation and Recognition
Yun Tang | Hongyu Gong | Ning Dong | Changhan Wang | Wei-Ning Hsu | Jiatao Gu | Alexei Baevski | Xian Li | Abdelrahman Mohamed | Michael Auli | Juan Pino
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this work, we describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method utilizes multi-task learning to integrate four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask, which leverages unlabelled speech data, and a (self-)supervised text to text subtask, which makes use of abundant text training data, take up the majority of the pre-training time. Two auxiliary supervised speech tasks are included to unify speech and text modeling space. Detailed analysis reveals learning interference among subtasks. In order to alleviate the subtask interference, two pre-training configurations are proposed for speech translation and speech recognition respectively. Our experiments show the proposed method can effectively fuse speech and text information into one model. It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset and comparable WERs to wav2vec 2.0 on the Librispeech speech recognition task.

pdf bib
Direct Speech-to-Speech Translation With Discrete Units
Ann Lee | Peng-Jen Chen | Changhan Wang | Jiatao Gu | Sravya Popuri | Xutai Ma | Adam Polyak | Yossi Adi | Qing He | Yun Tang | Juan Pino | Wei-Ning Hsu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages.

pdf bib
Text-Free Prosody-Aware Generative Spoken Language Modeling
Eugene Kharitonov | Ann Lee | Adam Polyak | Yossi Adi | Jade Copet | Kushal Lakhotia | Tu Anh Nguyen | Morgane Riviere | Abdelrahman Mohamed | Emmanuel Dupoux | Wei-Ning Hsu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (CITATION) is the only prior work addressing the generative aspect of speech pre-training, which builds a text-free language model using discovered units. Unfortunately, because the units used in GSLM discard most prosodic information, GSLM fails to leverage prosody for better comprehension and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm. Codes and models are available at https://github.com/pytorch/fairseq/tree/main/examples/textless_nlp/pgslm.

2021

pdf bib
fairseq Sˆ2: A Scalable and Integrable Speech Synthesis Toolkit
Changhan Wang | Wei-Ning Hsu | Yossi Adi | Adam Polyak | Ann Lee | Peng-Jen Chen | Jiatao Gu | Juan Pino
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This paper presents fairseq Sˆ2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically. To facilitate faster iteration of development and analysis, a suite of automatic metrics is included. Apart from the features added specifically for this extension, fairseq Sˆ2 also benefits from the scalability offered by fairseq and can be easily integrated with other state-of-the-art systems provided in this framework. The code, documentation, and pre-trained models will be made available at https://github.com/pytorch/fairseq/tree/master/examples/speech_synthesis.

pdf bib
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Wei-Ning Hsu | David Harwath | Tyler Miller | Christopher Song | James Glass
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.

pdf bib
On Generative Spoken Language Modeling from Raw Audio
Kushal Lakhotia | Eugene Kharitonov | Wei-Ning Hsu | Yossi Adi | Adam Polyak | Benjamin Bolte | Tu-Anh Nguyen | Jade Copet | Alexei Baevski | Abdelrahman Mohamed | Emmanuel Dupoux
Transactions of the Association for Computational Linguistics, Volume 9

Abstract We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.1

2016

pdf bib
Neural Attention for Learning to Rank Questions in Community Question Answering
Salvatore Romeo | Giovanni Da San Martino | Alberto Barrón-Cedeño | Alessandro Moschitti | Yonatan Belinkov | Wei-Ning Hsu | Yu Zhang | Mitra Mohtarami | James Glass
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In real-world data, e.g., from Web forums, text is often contaminated with redundant or irrelevant content, which leads to introducing noise in machine learning algorithms. In this paper, we apply Long Short-Term Memory networks with an attention mechanism, which can select important parts of text for the task of similar question retrieval from community Question Answering (cQA) forums. In particular, we use the attention weights for both selecting entire sentences and their subparts, i.e., word/chunk, from shallow syntactic trees. More interestingly, we apply tree kernels to the filtered text representations, thus exploiting the implicit features of the subtree space for learning question reranking. Our results show that the attention-based pruning allows for achieving the top position in the cQA challenge of SemEval 2016, with a relatively large gap from the other participants while greatly decreasing running time.

pdf bib
SLS at SemEval-2016 Task 3: Neural-based Approaches for Ranking in Community Question Answering
Mitra Mohtarami | Yonatan Belinkov | Wei-Ning Hsu | Yu Zhang | Tao Lei | Kfir Bar | Scott Cyphers | Jim Glass
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)