Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries

Ashish Mittal, Sunita Sarawagi, Preethi Jyothi, George Saon, Gakuto Kurata


Abstract
Despite the impressive performance of ASR models on mainstream benchmarks, their performance on rare words is unsatisfactory. In enterprise settings, often a focused list of entities (such as locations, names, etc) are available which can be used to adapt the model to the terminology of specific domains. In this paper, we present a novel inference algorithm that improves the prediction of state-of-the-art ASR models using nearest-neighbor-based matching on an inference-time word list. We consider both the Transducer architecture that is useful in the streaming setting, and state-of-the-art encoder-decoder models such as Whisper. In our approach, a list of rare entities is indexed in a memory by synthesizing speech for each entry, and then storing the internal acoustic and language model states obtained from the best possible alignment on the ASR model. The memory is organized as a trie which we harness to perform a stateful lookup during inference. A key property of our extension is that we prevent spurious matches by restricting to only word-level matches. In our experiments on publicly available datasets and private benchmarks, we show that our method is effective in significantly improving rare word recognition.
Anthology ID:
2023.emnlp-main.916
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14820–14835
Language:
URL:
https://aclanthology.org/2023.emnlp-main.916
DOI:
10.18653/v1/2023.emnlp-main.916
Bibkey:
Cite (ACL):
Ashish Mittal, Sunita Sarawagi, Preethi Jyothi, George Saon, and Gakuto Kurata. 2023. Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14820–14835, Singapore. Association for Computational Linguistics.
Cite (Informal):
Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries (Mittal et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.916.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.916.mp4