JoeyS2T: Minimalistic Speech-to-Text Modeling with JoeyNMT

Mayumi Ohta, Julia Kreutzer, Stefan Riezler


Abstract
JoeyS2T is a JoeyNMT extension for speech-to-text tasks such as automatic speech recognition and end-to-end speech translation. It inherits the core philosophy of JoeyNMT, a minimalist NMT toolkit built on PyTorch, seeking simplicity and accessibility. JoeyS2T’s workflow is self-contained, starting from data pre-processing, over model training and prediction to evaluation, and is seamlessly integrated into JoeyNMT’s compact and simple code base. On top of JoeyNMT’s state-of-the-art Transformer-based Encoder-Decoder architecture, JoeyS2T provides speech-oriented components such as convolutional layers, SpecAugment, CTC-loss, and WER evaluation. Despite its simplicity compared to prior implementations, JoeyS2T performs competitively on English speech recognition and English-to-German speech translation benchmarks. The implementation is accompanied by a walk-through tutorial and available on https://github.com/may-/joeys2t.
Anthology ID:
2022.emnlp-demos.6
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:
December
Year:
2022
Address:
Abu Dhabi, UAE
Editors:
Wanxiang Che, Ekaterina Shutova
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
50–59
Language:
URL:
https://aclanthology.org/2022.emnlp-demos.6
DOI:
10.18653/v1/2022.emnlp-demos.6
Bibkey:
Cite (ACL):
Mayumi Ohta, Julia Kreutzer, and Stefan Riezler. 2022. JoeyS2T: Minimalistic Speech-to-Text Modeling with JoeyNMT. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 50–59, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
JoeyS2T: Minimalistic Speech-to-Text Modeling with JoeyNMT (Ohta et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-demos.6.pdf