Context-sensitive evaluation of automatic speech recognition: considering user experience & language variation
Nina Markl | Catherine Lai
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing
Commercial Automatic Speech Recognition (ASR) systems tend to show systemic predictive bias for marginalised speaker/user groups. We highlight the need for an interdisciplinary and context-sensitive approach to documenting this bias incorporating perspectives and methods from sociolinguistics, speech & language technology and human-computer interaction in the context of a case study. We argue evaluation of ASR systems should be disaggregated by speaker group, include qualitative error analysis, and consider user experience in a broader sociolinguistic and social context.
Current multimodal sentiment analysis frames sentiment score prediction as a general Machine Learning task. However, what the sentiment score actually represents has often been overlooked. As a measurement of opinions and affective states, a sentiment score generally consists of two aspects: polarity and intensity. We decompose sentiment scores into these two aspects and study how they are conveyed through individual modalities and combined multimodal models in a naturalistic monologue setting. In particular, we build unimodal and multimodal multi-task learning models with sentiment score prediction as the main task and polarity and/or intensity classification as the auxiliary tasks. Our experiments show that sentiment analysis benefits from multi-task learning, and individual modalities differ when conveying the polarity and intensity aspects of sentiment.