FurChat: An Embodied Conversational Agent using LLMs, Combining Open and Closed-Domain Dialogue with Facial Expressions

We demonstrate an embodied conversational agent that can function as a receptionist and generate a mixture of open and closed-domain dialogue along with facial expressions, by using a large language model (LLM) to develop an engaging conversation. We deployed the system onto a Furhat robot, which is highly expressive and capable of using both verbal and nonverbal cues during interaction. The system was designed specifically for the National Robotarium to interact with visitors through natural conversations, providing them with information about the facilities, research, news, upcoming events, etc. The system utilises the state-of-the-art GPT-3.5 model to generate such information along with domain-general conversations and facial expressions based on prompt engineering.


Introduction
The progress in robotics and artificial intelligence in recent decades has led to the emergence of robots being utilized beyond their conventional industrial applications.Robot receptionists are designed to interact with and assist visitors in various places like offices, hotels, etc. by providing information about the location, services, and facilities.The appropriate use of verbal and non-verbal cues is very important for the robot's interaction with humans (Mavridis, 2015).Most research in the field has been mainly focused on developing domainspecific conversation systems, with little exploration into open-domain dialogue for social robots.
Conventional agents are often rule-based, which means they rely on pre-written commands and keywords that are pre-programmed.This limits the interaction with humans to little or no freedom of choice in answers (Tudor Car et al., 2020).The advancement of large language models (LLMs) in the past year has brought an exciting revolution in the field of natural language processing.With the development of models like GPT-3.51 , we have seen unprecedented progress in tasks such as questionanswering and text summarization (Brown et al., 2020).However, a question remains about how to successfully leverage the capabilities of LLMs to create systems that can go from closed domain to open, while also considering the embodiment of the system.
In this work, we present FurChat2 , an embodied conversational agent that utilises the latest advances in LLMs to create a more natural conversational experience.The system seamlessly combines open and closed-domain dialogues with emotive facial expressions, resulting in an engaging and personalised interaction for users.The system was initially designed and developed to serve as a recep- tionist for the National Robotarium, in continuation of the multi-party interactive model developed by Moujahid et al. (2022b), and its deployment shows promise in other areas due to the LLMs versatile capabilities.As a result, the system is not limited to the designated receptionist role, but can also engage in open-domain conversations, thereby enhancing its potential as a multifunctional conversational agent.We demonstrate the proposed conversational system on a Furhat robot (Al Moubayed et al., 2013) which is developed by the Swedish firm Furhat Robotics 3 .With FurChat, we demonstrate the possibility of LLMs for creating a more natural and intuitive conversation with robots.

Furhat Robot
Furhat is a social robot created by Furhat Robotics.To interact with humans naturally and intuitively, the robot employs advanced conversational AI and expressive facial expressions.A three-dimensional mask that mimics a human face is projected with an animated face using a microprojector (Al Moubayed et al., 2013).A motorised platform supports the robot's neck and head, allowing the platform's head to spin and node.To identify and react to human speech, it has a microphone array and speakers.Due to the human-like appearance of Furhat, it is prone to the uncanny valley effect (Ågren and Silvervarg, 2022).

System Architecture
As shown in Figure 2, the system architecture represents a conversational system that enables users to interact with a robot through spoken language.The system involves multiple components, including automatic speech recognition (ASR) for converting user speech to text, natural language understanding (NLU) for processing and interpreting the text, a dialogue manager (DM) for managing the interaction flow, and natural language generation (NLG) powered by GPT-3.5 for generating natural sounding responses (Ross et al., 2023).The generated text is then converted back to speech using text-to-speech (TTS) technology and played through the robot's speaker to complete the interaction loop.The system relies on a database to retrieve relevant data based on the user's intent.

Speech Recognition
The current system uses the Google Cloud Speechto-Text4 module for ASR.This module, which transcribes spoken words into text using machine learning algorithms, is integrated into the system by default through the Furhat SDK.

Dialogue Management
Dialogue Management consists of three submodules: NLU, DM and a database storage.The NLU component analyses the incoming text from the ASR module and, through machine learning techniques, breaks it down into a structured set of definitions (Otter et al., 2021).The FurhatOS provides an NLU model to classify the text into intents based on a confidence score.We provide multiple custom intents for identifying closed-domain intents using Furhat's NLU capabilites.
The in-built dialogue manager in the Furhat SDK is responsible for maintaining the flow of conversation and managing the dialogue state based on the intents identified by the NLU component.This module is responsible for sending the appropriate prompt to the LLM, receiving a candidate response from the model, and subsequent processing of the response to add in desired facial gestures (see §3.4).
An open challenge faced by present-day LLMs is the hallucination of nonfactual content, which potentially undermines user trust and raises concerns of safety.While we cannot fully mitigate hallucinated content in the generated responses, in order to tone-down this effect, we create a custom database following suggestions from Kumar (2023).We do so by manually web-scraping the website of the National Robotarium5 .The database consists of a dictionary of items with the intents as keys and scraped data as values.When an appropriate intent is triggered, the dialogue manager accesses the database to retrieve the scraped data, which is then sent with the prompt (further details in §3.3)) to elicit a response from the LLM.

Prompt engineering for NLG
The NLG module is responsible for generating a response based on the request from the dialogue manager.Prompt engineering is done to elicit an appropriate sounding response from the LLM, which generates natural dialogue that results in engaging conversations with humans.The current system uses text-davinci-003, which is one of the most powerful models in the GPT-3.5 series and it is priced at $0.0200 per 1000 tokens.
Producing relevant responses was achieved using the combined technique of few-shot learning and prompt engineering, which enabled us to try different variations in techniques and produce a variety of output by the LLM.
During prompt engineering, the personality of the robot and the context of the application are described, along with the past few dialogue histories and scraped data from the database in a particular response format.Moreover, the prompt engineering methodology involves using the LLM to generate an appropriate emoticon based on the conversation.In the context of emotional expression during an interaction, selecting an appropriate emoticon depends on understanding the underlying emotions being conveyed by the visitors and adhering to the display rules of the specific social situation.If the dialogue reflects joy or humor, a happy facial gesture might be fitting.On the other hand, if the conversation conveys empathy or sadness, a sad face could be more suitable.These emoticons are then integrated with the robot's facial gestures to generate facial expressions (see §3.4), thereby enabling a text-based LLM to integrate in the embodied Furhat robot.The explicit specification of the personality and context in the prompt aids in creating a natural conversation between the robot and the human that is coherent and relevant to the topic.The sample format of the prompt used is as follows: This is a conversation with a robot receptionist, <Robot Personality>, <Data from the Database>, <Dialogue history>, <Response Format along with sample emoticons>.

Gesture Parsing
The Furhat SDK offers a range of built-in facial gestures that can be enhanced by custom facial gestures that meet specific needs.The latest GPT models have the ability to recognise emotions and sentiments from text, which is used in the system (Leung et al., 2023).Rather than simply recognising sentiments in the text, the model is tasked with generating appropriate emotions for the conversation from the text After receiving the response from the model, the matched conditional clause in the dialogue manager will trigger an expression from the pre-developed set of gestures, which will be triggered along with the generated speech.

Text-To-Speech Generation
For converting the text to speech, the Amazon Polly 6 service is used.This service is available within FurhatOS by default.1: Sample Conversation between the user and the robot.For a full system description, please refer to §3.

Conclusions and Future Work
We demonstrate FurChat, an embodied conversational agent with open and closed domain dialogue generation and facial expressions generated through LLMs, on a social robot in a receptionist environment.The system is developed by integrating the state-of-the-art GPT-3.5 model on top of the Furhat SDK.The proposed system uses a one-to-one interaction method of communication with the visitors.We plan on extending the system to handle multi-party interaction (Moujahid et al., 2022a;Addlesee et al., 2023;Lemon, 2022;Gunson et al., 2022), which is an active research topic in developing receptionist robots.It is also crucial to address the issue of hallucination from the large language model and this problem can be mitigated by fine-tuning the language model and directly generating conversations from it without relying on any NLU components which we plan to implement in the future.
We plan to showcase the system on the Furhat robot during the SIGDIAL conference to all the attendees and show them the capabilities of using LLMs for dialogue and facial expression generation as described in this paper.

Figure 1 :
Figure 1: A user interacting with the FurChat System.

Figure 2 :
Figure 2: System Architecture of the current FurChat system.
At the outset, the robot remains in an idle state.Once the user enters the vicinity of the robot, the conversation begins.R: [robot] Hello, I am the Receptionist here at the National Robotarium.Would you like to know about this facility?<Robot smiles> U: [user] Yes, tell me about this facility.