Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment

Ethan A. Chi, Julian Salazar, Katrin Kirchhoff


Abstract
Non-autoregressive encoder-decoder models greatly improve decoding speed over autoregressive models, at the expense of generation quality. To mitigate this, iterative decoding models repeatedly infill or refine the proposal of a non-autoregressive model. However, editing at the level of output sequences limits model flexibility. We instead propose *iterative realignment*, which by refining latent alignments allows more flexible edits in fewer steps. Our model, Align-Refine, is an end-to-end Transformer which iteratively realigns connectionist temporal classification (CTC) alignments. On the WSJ dataset, Align-Refine matches an autoregressive baseline with a 14x decoding speedup; on LibriSpeech, we reach an LM-free test-other WER of 9.0% (19% relative improvement on comparable work) in three iterations. We release our code at https://github.com/amazon-research/align-refine.
Anthology ID:
2021.naacl-main.154
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1920–1927
Language:
URL:
https://aclanthology.org/2021.naacl-main.154
DOI:
10.18653/v1/2021.naacl-main.154
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.154.pdf
Data
LibriSpeech