Peng-Jen Chen


2023

pdf bib
Simple and Effective Unsupervised Speech Translation
Changhan Wang | Hirofumi Inaguma | Peng-Jen Chen | Ilia Kulikov | Yun Tang | Wei-Ning Hsu | Michael Auli | Juan Pino
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages. To address this issue, we study a simple and effective approach to build speech translation systems without labeled data by leveraging recent advances in unsupervised speech recognition, machine translation and speech synthesis, either in a pipeline approach, or to generate pseudo-labels for training end-to-end speech translation models. Furthermore, we present an unsupervised domain adaptation technique for pre-trained speech models which improves the performance of downstream unsupervised speech recognition, especially for low-resource settings. Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art by 3.2 BLEU on the Libri-Trans benchmark, on CoVoST 2, our best systems outperform the best supervised end-to-end models (without pre-training) from only two years ago by an average of 5.0 BLEU over five X-En directions. We also report competitive results on MuST-C and CVSS benchmarks.

pdf bib
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
Hirofumi Inaguma | Sravya Popuri | Ilia Kulikov | Peng-Jen Chen | Changhan Wang | Yu-An Chung | Yun Tang | Ann Lee | Shinji Watanabe | Juan Pino
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.

pdf bib
Speech-to-Speech Translation for a Real-world Unwritten Language
Peng-Jen Chen | Kevin Tran | Yilin Yang | Jingfei Du | Justine Kao | Yu-An Chung | Paden Tomasello | Paul-Ambroise Duquenne | Holger Schwenk | Hongyu Gong | Hirofumi Inaguma | Sravya Popuri | Changhan Wang | Juan Pino | Wei-Ning Hsu | Ann Lee
Findings of the Association for Computational Linguistics: ACL 2023

We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating human annotated data, automatically mining data from large unlabeled speech datasets, and adopting pseudo-labeling to produce weakly supervised data. On the modeling, we take advantage of recent advances in applying self-supervised discrete representations as target for prediction in S2ST and show the effectiveness of leveraging additional text supervision from Mandarin, a language similar to Hokkien, in model training. Finally, we release an S2ST benchmark set to facilitate future research in this field.

2022

pdf bib
Direct Speech-to-Speech Translation With Discrete Units
Ann Lee | Peng-Jen Chen | Changhan Wang | Jiatao Gu | Sravya Popuri | Xutai Ma | Adam Polyak | Yossi Adi | Qing He | Yun Tang | Juan Pino | Wei-Ning Hsu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages.

pdf bib
Textless Speech-to-Speech Translation on Real Data
Ann Lee | Hongyu Gong | Paul-Ambroise Duquenne | Holger Schwenk | Peng-Jen Chen | Changhan Wang | Sravya Popuri | Yossi Adi | Juan Pino | Jiatao Gu | Wei-Ning Hsu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the VoxPopuli S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data and show an additional 2.0 BLEU gain. To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs.

pdf bib
The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
Naman Goyal | Cynthia Gao | Vishrav Chaudhary | Peng-Jen Chen | Guillaume Wenzek | Da Ju | Sanjana Krishnan | Marc’Aurelio Ranzato | Francisco Guzmán | Angela Fan
Transactions of the Association for Computational Linguistics, Volume 10

One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the Flores-101 evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are fully aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.

2021

pdf bib
Multilingual Translation from Denoising Pre-Training
Yuqing Tang | Chau Tran | Xian Li | Peng-Jen Chen | Naman Goyal | Vishrav Chaudhary | Jiatao Gu | Angela Fan
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
The Source-Target Domain Mismatch Problem in Machine Translation
Jiajun Shen | Peng-Jen Chen | Matthew Le | Junxian He | Jiatao Gu | Myle Ott | Michael Auli | Marc’Aurelio Ranzato
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

While we live in an increasingly interconnected world, different places still exhibit strikingly different cultures and many events we experience in our every day life pertain only to the specific place we live in. As a result, people often talk about different things in different parts of the world. In this work we study the effect of local context in machine translation and postulate that this causes the domains of the source and target language to greatly mismatch. We first formalize the concept of source-target domain mismatch, propose a metric to quantify it, and provide empirical evidence for its existence. We conclude with an empirical study of how source-target domain mismatch affects training of machine translation systems on low resource languages. While this may severely affect back-translation, the degradation can be alleviated by combining back-translation with self-training and by increasing the amount of target side monolingual data.

pdf bib
fairseq Sˆ2: A Scalable and Integrable Speech Synthesis Toolkit
Changhan Wang | Wei-Ning Hsu | Yossi Adi | Adam Polyak | Ann Lee | Peng-Jen Chen | Jiatao Gu | Juan Pino
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This paper presents fairseq Sˆ2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically. To facilitate faster iteration of development and analysis, a suite of automatic metrics is included. Apart from the features added specifically for this extension, fairseq Sˆ2 also benefits from the scalability offered by fairseq and can be easily integrated with other state-of-the-art systems provided in this framework. The code, documentation, and pre-trained models will be made available at https://github.com/pytorch/fairseq/tree/master/examples/speech_synthesis.

2020

pdf bib
Facebook AI’s WMT20 News Translation Task Submission
Peng-Jen Chen | Ann Lee | Changhan Wang | Naman Goyal | Angela Fan | Mary Williamson | Jiatao Gu
Proceedings of the Fifth Conference on Machine Translation

This paper describes Facebook AI’s submission to WMT20 shared news translation task. We focus on the low resource setting and participate in two language pairs, Tamil <-> English and Inuktitut <-> English, where there are limited out-of-domain bitext and monolingual data. We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain. We explore techniques that leverage bitext and monolingual data from all languages, such as self-supervised model pretraining, multilingual models, data augmentation, and reranking. To better adapt the translation system to the test domain, we explore dataset tagging and fine-tuning on in-domain data. We observe that different techniques provide varied improvements based on the available data of the language pair. Based on the finding, we integrate these techniques into one training pipeline. For En->Ta, we explore an unconstrained setup with additional Tamil bitext and monolingual data and show that further improvement can be obtained. On the test set, our best submitted systems achieve 21.5 and 13.7 BLEU for Ta->En and En->Ta respectively, and 27.9 and 13.0 for Iu->En and En->Iu respectively.

pdf bib
Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment
Philipp Koehn | Vishrav Chaudhary | Ahmed El-Kishky | Naman Goyal | Peng-Jen Chen | Francisco Guzmán
Proceedings of the Fifth Conference on Machine Translation

Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018, 2019), we posed again the challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting the highest-quality data to be used to train ma-chine translation systems. This year, the task tackled the low resource condition of Pashto–English and Khmer–English and also included the challenge of sentence alignment from document pairs.

2019

pdf bib
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
Francisco Guzmán | Peng-Jen Chen | Myle Ott | Juan Pino | Guillaume Lample | Philipp Koehn | Vishrav Chaudhary | Marc’Aurelio Ranzato
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on low-resource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the FLORES evaluation datasets for Nepali–English and Sinhala– English, based on sentences translated from Wikipedia. Compared to English, these are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT. Data and code to reproduce our experiments are available at https://github.com/facebookresearch/flores.

pdf bib
Facebook AI’s WAT19 Myanmar-English Translation Task Submission
Peng-Jen Chen | Jiajun Shen | Matthew Le | Vishrav Chaudhary | Ahmed El-Kishky | Guillaume Wenzek | Myle Ott | Marc’Aurelio Ranzato
Proceedings of the 6th Workshop on Asian Translation

This paper describes Facebook AI’s submission to the WAT 2019 Myanmar-English translation task. Our baseline systems are BPE-based transformer models. We explore methods to leverage monolingual data to improve generalization, including self-training, back-translation and their combination. We further improve results by using noisy channel re-ranking and ensembling. We demonstrate that these techniques can significantly improve not only a system trained with additional monolingual data, but even the baseline system trained exclusively on the provided small parallel dataset. Our system ranks first in both directions according to human evaluation and BLEU, with a gain of over 8 BLEU points above the second best system.