2024
pdf
bib
abs
An Empirical Study of Consistency Regularization for End-to-End Speech-to-Text Translation
Pengzhi Gao
|
Ruiqing Zhang
|
Zhongjun He
|
Hua Wu
|
Haifeng Wang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Consistency regularization methods, such as R-Drop (Liang et al., 2021) and CrossConST (Gao et al., 2023), have achieved impressive supervised and zero-shot performance in the neural machine translation (NMT) field. Can we also boost end-to-end (E2E) speech-to-text translation (ST) by leveraging consistency regularization? In this paper, we conduct empirical studies on intra-modal and cross-modal consistency and propose two training strategies, SimRegCR and SimZeroCR, for E2E ST in regular and zero-shot scenarios. Experiments on the MuST-C benchmark show that our approaches achieve state-of-the-art (SOTA) performance in most translation directions. The analyses prove that regularization brought by the intra-modal consistency, instead of the modality gap, is crucial for the regular E2E ST, and the cross-modal consistency could close the modality gap and boost the zero-shot E2E ST performance.
2022
pdf
bib
abs
Non-Autoregressive Chinese ASR Error Correction with Phonological Training
Zheng Fang
|
Ruiqing Zhang
|
Zhongjun He
|
Hua Wu
|
Yanan Cao
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Automatic Speech Recognition (ASR) is an efficient and widely used input method that transcribes speech signals into text. As the errors introduced by ASR systems will impair the performance of downstream tasks, we introduce a post-processing error correction method, PhVEC, to correct errors in text space. For the errors in ASR result, existing works mainly focus on fixed-length corrections, modifying each wrong token to a correct one (one-to-one correction), but rarely consider the variable-length correction (one-to-many or many-to-one correction). In this paper, we propose an efficient non-autoregressive (NAR) method for Chinese ASR error correction for both cases. Instead of conventionally predicting the sentence length in NAR methods, we propose a novel approach that uses phonological tokens to extend the source sentence for variable-length correction, enabling our model to generate phonetically similar corrections. Experimental results on datasets of different domains show that our method achieves significant improvement in word error rate reduction and speeds up the inference by 6.2 times compared with the autoregressive model.
pdf
bib
Proceedings of the Third Workshop on Automatic Simultaneous Translation
Julia Ive
|
Ruiqing Zhang
Proceedings of the Third Workshop on Automatic Simultaneous Translation
pdf
bib
abs
Findings of the Third Workshop on Automatic Simultaneous Translation
Ruiqing Zhang
|
Chuanqiang Zhang
|
Zhongjun He
|
Hua Wu
|
Haifeng Wang
|
Liang Huang
|
Qun Liu
|
Julia Ive
|
Wolfgang Macherey
Proceedings of the Third Workshop on Automatic Simultaneous Translation
This paper reports the results of the shared task we hosted on the Third Workshop of Automatic Simultaneous Translation (AutoSimTrans). The shared task aims to promote the development of text-to-text and speech-to-text simultaneous translation, and includes Chinese-English and English-Spanish tracks. The number of systems submitted this year has increased fourfold compared with last year. Additionally, the top 1 ranked system in the speech-to-text track is the first end-to-end submission we have received in the past three years, which has shown great potential. This paper reports the results and descriptions of the 14 participating teams, compares different evaluation metrics, and revisits the ranking method.
pdf
bib
abs
Learning Adaptive Segmentation Policy for End-to-End Simultaneous Translation
Ruiqing Zhang
|
Zhongjun He
|
Hua Wu
|
Haifeng Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
End-to-end simultaneous speech-to-text translation aims to directly perform translation from streaming source speech to target text with high translation quality and low latency. A typical simultaneous translation (ST) system consists of a speech translation model and a policy module, which determines when to wait and when to translate. Thus the policy is crucial to balance translation quality and latency. Conventional methods usually adopt fixed policies, e.g. segmenting the source speech with a fixed length and generating translation. However, this method ignores contextual information and suffers from low translation quality. This paper proposes an adaptive segmentation policy for end-to-end ST. Inspired by human interpreters, the policy learns to segment the source streaming speech into meaningful units by considering both acoustic features and translation history, maintaining consistency between the segmentation and translation. Experimental results on English-German and Chinese-English show that our method achieves a good accuracy-latency trade-off over recently proposed state-of-the-art methods.
2021
pdf
bib
Correcting Chinese Spelling Errors with Phonetic Pre-training
Ruiqing Zhang
|
Chao Pang
|
Chuanqiang Zhang
|
Shuohuan Wang
|
Zhongjun He
|
Yu Sun
|
Hua Wu
|
Haifeng Wang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
pdf
bib
Proceedings of the Second Workshop on Automatic Simultaneous Translation
Hua Wu
|
Colin Cherry
|
Liang Huang
|
Zhongjun He
|
Qun Liu
|
Maha Elbayad
|
Mark Liberman
|
Haifeng Wang
|
Mingbo Ma
|
Ruiqing Zhang
Proceedings of the Second Workshop on Automatic Simultaneous Translation
pdf
bib
abs
BSTC: A Large-Scale Chinese-English Speech Translation Dataset
Ruiqing Zhang
|
Xiyang Wang
|
Chuanqiang Zhang
|
Zhongjun He
|
Hua Wu
|
Zhi Li
|
Haifeng Wang
|
Ying Chen
|
Qinfei Li
Proceedings of the Second Workshop on Automatic Simultaneous Translation
This paper presents BSTC (Baidu Speech Translation Corpus), a large-scale Chinese-English speech translation dataset. This dataset is constructed based on a collection of licensed videos of talks or lectures, including about 68 hours of Mandarin data, their manual transcripts and translations into English, as well as automated transcripts by an automatic speech recognition (ASR) model. We have further asked three experienced interpreters to simultaneously interpret the testing talks in a mock conference setting. This corpus is expected to promote the research of automatic simultaneous translation as well as the development of practical systems. We have organized simultaneous translation tasks and used this corpus to evaluate automatic simultaneous translation systems.
pdf
bib
abs
Findings of the Second Workshop on Automatic Simultaneous Translation
Ruiqing Zhang
|
Chuanqiang Zhang
|
Zhongjun He
|
Hua Wu
|
Haifeng Wang
Proceedings of the Second Workshop on Automatic Simultaneous Translation
This paper presents the results of the shared task of the 2nd Workshop on Automatic Simultaneous Translation (AutoSimTrans). The task includes two tracks, one for text-to-text translation and one for speech-to-text, requiring participants to build systems to translate from either the source text or speech into the target text. Different from traditional machine translation, the AutoSimTrans shared task evaluates not only translation quality but also latency. We propose a metric “Monotonic Optimal Sequence” (MOS) considering both quality and latency to rank the submissions. We also discuss some important open issues in simultaneous translation.
2020
pdf
bib
abs
Dynamic Sentence Boundary Detection for Simultaneous Translation
Ruiqing Zhang
|
Chuanqiang Zhang
Proceedings of the First Workshop on Automatic Simultaneous Translation
Simultaneous Translation is a great challenge in which translation starts before the source sentence finished. Most studies take transcription as input and focus on balancing translation quality and latency for each sentence. However, most ASR systems can not provide accurate sentence boundaries in realtime. Thus it is a key problem to segment sentences for the word streaming before translation. In this paper, we propose a novel method for sentence boundary detection that takes it as a multi-class classification task under the end-to-end pre-training framework. Experiments show significant improvements both in terms of translation quality and latency.
pdf
bib
abs
Learning Adaptive Segmentation Policy for Simultaneous Translation
Ruiqing Zhang
|
Chuanqiang Zhang
|
Zhongjun He
|
Hua Wu
|
Haifeng Wang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Balancing accuracy and latency is a great challenge for simultaneous translation. To achieve high accuracy, the model usually needs to wait for more streaming text before translation, which results in increased latency. However, keeping low latency would probably hurt accuracy. Therefore, it is essential to segment the ASR output into appropriate units for translation. Inspired by human interpreters, we propose a novel adaptive segmentation policy for simultaneous translation. The policy learns to segment the source text by considering possible translations produced by the translation model, maintaining consistency between the segmentation and translation. Experimental results on Chinese-English and German-English translation show that our method achieves a better accuracy-latency trade-off over recently proposed state-of-the-art methods.