Susan W. McRoy

Also published as: Susan McRoy

2019

Using Gestures to Resolve Lexical Ambiguity in Storytelling with Humanoid Robots
Catelyn Scholl | Susan McRoy
Dialogue Discourse Volume 10

Gestures that co-occur with speech are a fundamental component of communication. Prior research with children suggests that gestures may help them to resolve certain forms of lexical ambiguity, including homophones. To test this idea in the context of human-robot interaction, the effects of iconic and deictic gestures on the understanding of homophones was assessed in an experiment where a humanoid robot told a short story containing pairs of homophones to small groups of young participants, accompanied by either expressive gestures or no gestures. Both groups of subjects completed a pretest and post-test to measure their ability to discriminate between pairs of homophones and we calculated aggregated precision. The results show that the use of iconic and deictic gestures aids in general understanding of homophones, providing additional evidence for the importance of gesture to the development of children’s language and communication skills.

People, when processing human-to-human communication, utilize everything they can in order to understand that communication, including speech and information such as the time and location of an interlocutor's gesture and gaze. Speech and gesture are known to exhibit a synchronous relationship in human communication; however, the precise nature of that relationship requires further investigation. The construction of computer models of multimodal human communication would be enabled by the availability of multimodal communication corpora annotated with synchronized gesture and speech features. To investigate the temporal relationships of these knowledge sources, we have collected and are annotating several multimodal corpora with time-aligned features. Forced alignment between a speech file and its transcription is a crucial part of multimodal corpus production. This paper investigates a number of factors that may contribute to highly accurate forced alignments to support the rapid production of these multimodal corpora including the acoustic model, the match between the speech used for training the system and that to be force aligned, the amount of data used to train the ASR system, the availability of speaker adaptation, and the duration of alignment segments.