Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems

Saiteja Kosgi, Sarath Sivaprasad, Niranjan Pedanekar, Anil Nelakanti, Vineet Gandhi


Abstract
We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to disentangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted “human touch” in machine dialogue. Audio samples from our experiments and the code are available at: https://emtts.github.io/tts-demo/
Anthology ID:
2022.naacl-main.26
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
336–347
Language:
URL:
https://aclanthology.org/2022.naacl-main.26
DOI:
10.18653/v1/2022.naacl-main.26
Bibkey:
Cite (ACL):
Saiteja Kosgi, Sarath Sivaprasad, Niranjan Pedanekar, Anil Nelakanti, and Vineet Gandhi. 2022. Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 336–347, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems (Kosgi et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.26.pdf
Video:
 https://aclanthology.org/2022.naacl-main.26.mp4