Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning

Wonjun Lee, San Kim, Gary Geunbae Lee


Abstract
Recent dialogue systems typically operate through turn-based spoken interactions between users and agents. These systems heavily depend on accurate Automatic Speech Recognition (ASR), as transcription errors can significantly degrade performance in downstream dialogue tasks. To alleviate this challenge, robust ASR is required, and one effective method is to utilize the dialogue context from user and agent interactions for transcribing the subsequent user utterance. This method incorporates the transcription of the user’s speech and the agent’s response as model input, using the accumulated context generated by each turn. However, this context is susceptible to ASR errors because the ASR model generates it auto-regressively. Such noisy context can further degrade the benefits of context input, resulting in suboptimal ASR performance. In this paper, we introduce context noise representation learning to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. To maximize the advantage of context awareness, our approach involves decoder pre-training with text-based dialogue data and noise representation learning for a context encoder. Evaluated on DSTC11 (MultiWoZ 2.1 audio dialogues), it achieves a 24% relative reduction in Word Error Rate (WER) compared to wav2vec2.0 baselines and a 13% reduction compared to Whisper-large-v2. Notably, in noisy environments where user speech is barely audible, our method proves its effectiveness by utilizing contextual information for accurate transcription. Tested on audio data with strong noise level (Signal Noise Ratio of 0dB), our approach shows up to a 31% relative WER reduction compared to the wav2vec2.0 baseline, providing a reassuring solution for real-world noisy scenarios.
Anthology ID:
2024.sigdial-1.30
Volume:
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:
September
Year:
2024
Address:
Kyoto, Japan
Editors:
Tatsuya Kawahara, Vera Demberg, Stefan Ultes, Koji Inoue, Shikib Mehri, David Howcroft, Kazunori Komatani
Venue:
SIGDIAL
SIG:
SIGDIAL
Publisher:
Association for Computational Linguistics
Note:
Pages:
333–343
Language:
URL:
https://aclanthology.org/2024.sigdial-1.30
DOI:
Bibkey:
Cite (ACL):
Wonjun Lee, San Kim, and Gary Geunbae Lee. 2024. Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning. In Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 333–343, Kyoto, Japan. Association for Computational Linguistics.
Cite (Informal):
Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning (Lee et al., SIGDIAL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sigdial-1.30.pdf