Session-level Language Modeling for Conversational Speech

Wayne Xiong, Lingfeng Wu, Jun Zhang, Andreas Stolcke


Abstract
We propose to generalize language models for conversational speech recognition to allow them to operate across utterance boundaries and speaker changes, thereby capturing conversation-level phenomena such as adjacency pairs, lexical entrainment, and topical coherence. The model consists of a long-short-term memory (LSTM) recurrent network that reads the entire word-level history of a conversation, as well as information about turn taking and speaker overlap, in order to predict each next word. The model is applied in a rescoring framework, where the word history prior to the current utterance is approximated with preliminary recognition results. In experiments in the conversational telephone speech domain (Switchboard) we find that such a model gives substantial perplexity reductions over a standard LSTM-LM with utterance scope, as well as improvements in word error rate.
Anthology ID:
D18-1296
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2764–2768
Language:
URL:
https://aclanthology.org/D18-1296
DOI:
10.18653/v1/D18-1296
Bibkey:
Cite (ACL):
Wayne Xiong, Lingfeng Wu, Jun Zhang, and Andreas Stolcke. 2018. Session-level Language Modeling for Conversational Speech. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2764–2768, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Session-level Language Modeling for Conversational Speech (Xiong et al., EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1296.pdf
Video:
 https://aclanthology.org/D18-1296.mp4