End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Juan Pablo Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinivasan, Prashant Mathur, Brian Thompson, Marcello Federico


Abstract
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training.
Anthology ID:
2023.emnlp-main.449
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7255–7274
Language:
URL:
https://aclanthology.org/2023.emnlp-main.449
DOI:
10.18653/v1/2023.emnlp-main.449
Bibkey:
Cite (ACL):
Juan Pablo Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinivasan, Prashant Mathur, Brian Thompson, and Marcello Federico. 2023. End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7255–7274, Singapore. Association for Computational Linguistics.
Cite (Informal):
End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation (Zuluaga-Gomez et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.449.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.449.mp4