History-Aware Hierarchical Transformer for Multi-session Open-domain Dialogue System

With the evolution of pre-trained language models, current open-domain dialogue systems have achieved great progress in conducting one-session conversations. In contrast, Multi-Session Conversation (MSC), which consists of multiple sessions over a long term with the same user, is under-investigated. In this paper, we propose History-Aware Hierarchical Transformer (HAHT) for multi-session open-domain dialogue. HAHT maintains a long-term memory of history conversations and utilizes history information to understand current conversation context and generate well-informed and context-relevant responses. Specifically, HAHT first encodes history conversation sessions hierarchically into a history memory. Then, HAHT leverages historical information to facilitate the understanding of the current conversation context by encoding the history memory together with the current context with attention-based mechanisms. Finally, to explicitly utilize historical information, HAHT uses a history-aware response generator that switches between a generic vocabulary and a history-aware vocabulary. Experimental results on a large-scale MSC dataset suggest that the proposed HAHT model consistently outperforms baseline models. Human evaluation results support that HAHT generates more human-like, context-relevant and history-relevant responses than baseline models.


Introduction
Open-domain dialogue systems, also known as chatbots, are designed to converse with and engage users on any topic with the aim of establishing, maintaining, and strengthening long-term relationships (Clark et al., 2019;Roller et al., 2020).Recently, open-domain dialogue systems built based on large-scale generative pre-trained models (Adiwardana et al., 2020;Roller et al., 2021;Zhang et al., 2020) have substantially improved the performance of chatbots.However, most existing chatbots are designed to interact with users in a single conversation session.When the current session ends, the chatbot forgets its contents and will commence a new independent session with the same user next time.When previously discussed topics reemerge, such chatbots often appear ignorant and fail to reengage users appropriately.The apparent forgetfulness limits the chatbots' ability to establish and maintain long-term relationships with users.
We argue that, to better engage users in multisession conversations (MSCs), a chatbot should maintain a long-term memory of historical contexts, which allows the chatbot to reengage the user appropriately when similar contexts reemerge.By learning from historical conversations, the chatbot should gradually refine its understanding of and deepen its relationship with the user.Figure 1 shows an example of a two-session conversation between a user and a chatbot.In the second session, the chatbot infers that Sonny is a cat and generates the response based on the history information that Sonny likes watching TV with the user.
History-aware chatbots will be able to generate more well-informed and context-relevant re-sponses, which can help to elicit long-term commitments and develop emotional attachments from users to sustain close relationships over time.To this end, we propose the History-Aware Hierarchical Transformer (HAHT) for multi-session opendomain dialogue systems, which can effectively leverage history conversations to conduct more engaging MSCs.HAHT maintains a long-term memory to store historical conversational contexts, which is updated when a new session is conducted.Based on the long-term memory and the context in the current session, relevant tokens in historical contexts are selected to adapt the current response.
Specifically, as the number of tokens in a conversation utterance and the number of turns in a conversation are usually not very long1 , we first encode the history conversation hierarchically into the history memory using Transformer (Vaswani et al., 2017).The history memory serves as a high-level representation of history conversations.Secondly, as history conversations usually can facilitate the understanding of the current conversation context, we design a history-aware context encoder.The context encoder encodes conversation context, considering both history conversations and the current conversation, by adopting the transformer attention over the history memory and current conversation context.Then, the context encoder also updates the history memory based on the current conversation context.Finally, we design a history-aware decoder to fuse learned history information into the response generation process.The history-aware decoder can switch between two strategies, i.e., generating a word from the generic vocabulary or directly copying a word from history conversations.
Experimental results on the large-scale Facebook MSC dataset show that the proposed HAHT model outperforms previous multi-session opendomain dialogue systems in various evaluation metrics.Human evaluation results support that HAHT generates more readable, context-relevant, and history-relevant responses than baseline models.In addition, the ablation study confirms that both the hierarchical encoding of history conversations and the history-aware decoder contribute greatly to HAHT's performance on MSCs and help it leverage historical information more effectively.
Open-domain dialogue systems aim to perform chit-chat without task and domain restrictions (Ritter et al., 2011) and establish long-term relationships with users (Clark et al., 2019;Roller et al., 2020).They are generally divided into two groups: generation-based systems and retrieval-based systems.Retrieval-based systems seek to find a suitable response from a large response candidate set (Zhou et al., 2016;Yuan et al., 2019;Zhong et al., 2020;Zhu et al., 2021;Qian et al., 2021), whereas, generation-based systems focus on generating responses from scratch based on the dialogue history (Serban et al., 2016;Shum et al., 2018;Adiwardana et al., 2020;Roller et al., 2020;Xu et al., 2022).In this paper, we focus on generation-based systems.
Despite the advancements in the field, current state-of-the-art generative pre-trained models are designed for and trained on large datasets of singlesession conversations with a small number of turns.As a result, most existing models employ short token truncation lengths, such as 128 tokens for Meena (Adiwardana et al., 2020), and are unable to encode and utilize historical contexts in MSCs effectively.In addition, there is also a lack of public MSC datasets.Xu et al. released the first multi-session conversation dataset, i.e., Facebook MULTI-SESSION CHAT (Facebook MSC), and explored different retrieval-augmented generative models on the dataset (Lewis et al., 2020;Shuster et al., 2021), which achieved better results than the standard Transformer (Vaswani et al., 2017).However, the experimental results demonstrate that their methods need to retrieve a very large portion of history conversations to achieve better results than the standard Transformer.In addition, these models still need to concatenate the retrieved raw history conversation text with the current conversation context, yielding concatenations that are still much longer than the 128 token truncation lengths.Therefore, the incorporation of historical contexts in these methods is still limited by the short token truncation lengths of pre-trained models.

The Proposed Method
In general, a Multi-Session Conversation (MSC) consists of a current conversation session and several history conversation sessions that happen before the current one, all between the same two interlocutors.A multi-session open-domain dialogue system aims to generate natural, well-informed, and context-relevant responses to the user's utterances based on all history conversation sessions and the current conversation context.
Formally, we denote the MSC dataset D by a list of N conversations in the format of (H, X, y). Here, chronologically ordered utterances of the i-th history conversation session.y is the ground truth response to X under the background of H.The MSC task can be formulated as learning a function f (H, X) to predict the next utterance x nx+1 based on H and X.In this work, we propose a novel model, namely HAHT, for the MSC task. Figure 2 shows the overall structure of HAHT, which consists of three main components: 1) hierarchical history conversation encoder, 2) history-aware context encoder, and 3) history-aware response generator.We present the details of each component of HAHT as follows.

Hierarchical History Conversation Encoder
The main challenge in encoding history conversation sessions is the limited maximum input length imposed by pre-trained dialogue systems.If all history conversations are simply concatenated and fed into the pre-trained dialogue system, the length of the concatenation will exceed the maximum input length.Thus, most parts of the input will be truncated.To preserve more information in the history conversation, we encode each history conversation session separately in a hierarchical fashion.Specifically, for a history conversation session we first prepend a special token "User:" or "Assistant:" to each utterance h i j in H i depending on the role of the utterance speaker, and then pad all utterances to the same length l utter .For each utterance h i j , we apply an embedding layer E m , n enc Transformer encoder layers, and a Max-pooling layer to obtain its dense representation as follows, where u i j ∈ R d .Moreover, we denote all the utterance representations in the history conversation conversation aggregator F c to aggregate all utterance representations U i into the condensed history memory c i , The conversation aggregator is developed based on the following self-attentive mechanism (Lin et al., 2017), where W q and W k are learnable parameters.α ∈ R n i is the importance vector of the history conversation utterances in H i .After applying previous steps to all history conversations H, we will finally obtain a history memory matrix C ∈ R M ×d containing a history memory for each history conversation, where M is the number of history conversation sessions.

History-aware Context Encoder
History conversation sessions usually contain the background stories (e.g., interlocutors' profiles or previous discussions between them) that bring out the current conversation session.Leveraging the history conversations will help the model to better understand the current conversation context and respond properly.On the other hand, the current conversation context can help the model update the history memories.Thus, we encode the history memory C together with the current conversation context by adopting the transformer attention between them.
For the current conversation context X, we also prepend a special token "User:" or "Assistant:" to each utterance depending on the role of the utterance speaker and concatenate all utterances into a single sentence.Then, we adopt the embedding By employing attention in the transformer encoder layers, our model can understand the conversation context by attending to all context token embeddings and history conversation memories.We denote this history-aware context encoding by S c ∈ R nx×d .After context encoding, history conversation memories are updated based on the latest information from the current conversation context.We denote this context-updated history memory as C s ∈ R M ×d .The concatenation of C s and S c over the first dimension will become the input of the response generator.

History-aware Response Generator
Inspired by CopyNet (Gu et al., 2016), we construct two vocabularies, i.e., generic vocabulary V g and history-aware vocabulary V h , to better generate history-aware responses.The generic vocabulary V g contains the words that appear in all the training dataset, and the history-aware vocabulary V h only contain the words that appear in the history conversations H.To generate a word of the response, the response generator will choose to generate a generic word from V g or directly copy a word from V h based on the switching mechanism (Gulcehre et al., 2016).Specifically, at each decoding time step t, we feed C s , S c and the ground truth word sequence before t into n dec Transformer decoder layers and obtain a hidden representation vector o t ∈ R d .The probability distribution over the generic vocabulary V g at the decoding time step t is computed as, where FC 1 is a fully connected layer.
To calculate the probability distribution over the history-aware vocabulary V h , we adopt a maxpooling layer over the context-updated history memory C s , a fully connected layer, and a softmax function as follows, where FC 2 is a fully connected layer.
The final word probability distribution at time step t is computed by using a switching mechanism between P vg and P v h as follows, where α vg and α v h is the switching probability of generating from generic vocabulary or copying from history conversations.α vg and α v h is calculated as follows, where FC 3 is a fully connected layer, and [;] is a concatenation operation over the last dimension.

Model Training
We train the model to maximize the generation probability of the target response, given the current conversation context and history conversations in an end-to-end manner.The loss function of HAHT is defined as, where X denotes the current conversation context, H denotes all history conversations, y <t denotes tokens before time step t, and n y denotes the length of the ground truth response.

Experimental Settings
In this section, we introduce the experimental dataset, evaluation metrics, baseline methods, and model settings.

Experimental Dataset
The experiments are performed on a large dataset, i.e., Facebook MULTI-SESSION CHAT (Facebook MSC) (Xu et al., 2022).It is a crowdsourced dataset consisting of multi-session conversations, where the interlocutors learn about each other's interests and discuss the things they have understood from past sessions.The number of history conversations in Facebook MSC varies from 1 to 4. Session number i indicates there are i-1 history conversations happening before the last conversation session.The statistics of the Facebook MSC dataset are summarized in Table 1.As session 1 does not have history conversations, we evaluate our model on session 2-5.

Evaluation Metrics
We conduct both automatic and human evaluations to demonstrate the effectiveness of the proposed model.For automatic evaluations, we leverage BLEU-2, BLEU-3 (Papineni et al., 2002), and ROUGE-L (Lin and Och, 2004) to measure word overlaps between the generated response text and ground truth text.Moreover, we also randomly sample 50 MSCs from the test set to conduct human evaluations.We present all the history conversation sessions, current conversation context, and the generated responses to three well-educated annotators.The annotators will evaluate the quality of the generated responses from the following three aspects: • Readability: measures whether the generated responses are natural and fluent.• Context Relevancy: measures whether the generated responses are correlated with the current conversation context.• History Relevancy: measures whether the generated responses are correlated with history con-  and Cohen, 1973).Our annotations obtain "good agreement" for Readability (0.614) and "moderate agreement" for Context Relevancy (0.526) and History Relevancy (0.573).

Model Session 2 Session 3 Session 4 Session
versations.Only responses that are consistent with history conversations are considered relevant to history.
Each aspect is rated in four different levels 0/1/2/3, and the final score of each aspect is the average of the scores given by all annotators.We measure the inter-annotator reliability with Fleiss' Kappa (Fleiss and Cohen, 1973).For all evaluation metrics, the higher value indicates better performance.

Baseline Methods
We compare the proposed HAHT model with the following baseline methods.
• BlenderBot (Roller et al., 2021): This is a largescale open-domain dialogue model pre-trained on the dialogue data scraped from social discussions on the web.
• BlenderBot msc : This is the BlenderBot model finetuned on the MSC dataset.
• FID-RAG (Shuster et al., 2021): In this method, RAG-trained retriever (Lewis et al., 2020) is used to retrieve top-N history conversations, and Fusion-Decoder (FiD) (Izacard and Grave, 2021) is adopted to generate a final response considering the retrieved history conversations and current conversations.Following (Xu et al., 2022), N is empirically set to 5.

Model Settings
In this work, all the evaluated methods are trained following the same settings.Due to the limitation of computation resources, we use the Blender-Bot model with 90M parameters as the initial pretrained model and finetune it on the Facebook MSC dataset.The input length truncation is set to 256.The number of Transformer encoder layers n enc and decoder layers n dec are both set to 12.For model training, we use the Adamax optimizer (Kingma and Ba, 2014) with a learning rate of 1 × 10 −6 , batch size of 16, dropout ratio of 0.1, and early stopping patience of 10.All the finetuned models are trained with a maximum of two 32GB GPUs (NVIDIA V100).

Experimental Results
This section presents the experimental results of the automatic evaluation, human evaluation, evaluation on session openings, ablation study, and case study.

Automatic Evaluation
The automatic evaluation results of different models are shown in Table 2.It can be observed that BlenderBot msc performs much better when finetuned on the MSC dataset.FID-RAG performs better than BlenderBot msc .The potential reason is that RAG can retrieve important history conversations, and FID can combine the retrieved conversations with current conversations to generate better responses.Moreover, the proposed HAHT model consistently outperforms baseline methods in terms of all the evaluation metrics.This indicates that HAHT can better encode the history conversations, leverage history conversations to understand the current conversation context and generate more human-like responses.4: Automatic evaluation results of different models on session-opening data.Session i indicates there are i-1 history conversation sessions.B-2, B-3, and R-L denote BLEU-2, BLEU-3, and Rouge-L respectively.The best results are in boldface.

Human Evaluation
Table 3 summarizes the human evaluation results on the Facebook MSC dataset.Generally, HAHT outperforms all the baseline methods in terms of all perspectives.This observation is consistent with the automatic evaluation results shown in Table 2.In particular, we find that HAHT performs much better than other baselines in terms of history relevancy.This demonstrates that HAHT can better leverage the history conversation sessions and engage the user more in the on-going session with the history memory.HAHT also performs better than other baselines in terms of readability and context relevancy.This indicates that HAHT can better understand the current conversation context with the help of the history memory.

Evaluation on Session Openings
In the MSC task, the session opening is the first conversation turn of the current conversation.According to our observation and the similar observation in (Xu et al., 2022), the opening conversation turn is categorically different from other conversation turns.It typically involves a statement or question that aims to reengage the other speaker based on the known information that has been exchanged in history conversations.Therefore, the performance on the session opening data can further demonstrate the model's capability in understanding and leveraging history conversations.
We compare all models on these opening re-sponses and show the results in Table 4.We observe that the proposed HAHT model achieves the best performance in terms of most metrics.Especially, when there are 4 history conversations, HAHT outperforms FID-RAG and BlenderBot msc by 10.6% and 9.1% in terms of BLUE-3.This indicates that the proposed HAHT can better leverage conversation history to reengage the user into a new conversation session.

Ablation study
To better understand the effectiveness of each main component of HAHT, we conduct ablation study for HAHT.Specifically, we consider the following variants of HAHT.
• HAHT w/o HIER : In this variant, we do not encode the history conversations hierarchically.Instead, we concatenate all the utterances of history conversations into a long sentence and directly encode it using the transformer encoder.
• HAHT w/o HIST : In this variant, we remove the history encoder from HAHT.
• HAHT w/o SW : In this variant, we remove the switching mechanism from the response generator of HAHT.
Table 5 summarizes the results achieved by different HAHT variants, in terms of BLEU-2, BLEU-3, and Rouge-L.We note that HAHT outperforms HAHT w/o HIER , which indicates that hierarchically encoding the history conversations can help the model reserve more history memory to generate more human-like responses.Moreover, HAHT achieves better performance than HAHT HIST .This observation indicates that removing the history encoder causes the most decline in all metrics.This result confirms the necessity to leverage history conversations to understand the current conversation and generate the response.In addition, the performance degradation caused by removing the switching mechanism shows that directly copying words from the history conversation can help the model generate more history-aware responses.

Case Study
Table 6 shows a case study of the multi-session conversations generated by different models.Compared to baseline models, the proposed HAHT model can better leverage history conversations to understand the current conversation context and generate more history-aware responses.When the user discusses preparing sandwiches and lemonade with the agent, "I can make sandwiches for us!I also have a very good recipe for homemade lemonade!Do you like lemonade?",HAHT can remember information mentioned in the history conversations, such as the user likes reading and outdoor activities and it has adopted a book-lover persona before.HAHT can leverage these historical contexts and generate more human-like, context-relevant, and history-aware responses: "I love lemonade!I'm sure we can find a lot of good recipes for sandwiches too.Sandwiches and lemonade are perfect for going outdoors and reading books.".

Conclusion
In this work, we propose the History-Aware Hierarchical Transformer (HAHT) model for multisession open-domain dialogue systems.The proposed HAHT model maintains a history memory by hierarchically encoding the history conversation sessions.After that, HAHT uses attention-based encoding to encode the current conversation context together with the history memory and updates the history memory with the current context.In order to explicitly leverage historical information in the responses, HAHT is designed with a history-aware response generator which can switch between a generic vocabulary and a history-aware vocabulary.HAHT performs better in conducting MSC and generates more human-like, context-relevant, and history-aware responses than state-of-the-art models.

Limitations
One limitation of this work is that HAHT has only been evaluated on one dataset.However, to the best of best of our knowledge, Facebook MSC is, by far, the only large-scale multi-session conversation dataset available.Nevertheless, our proposed model consistently outperforms baseline models on conversations with different numbers of history sessions in Facebook MSC.
A potential solution to this limitation is to construct more MSC datasets in open-domain or in specific-domain that may benefit from the awareness of history conversations, e.g., conversational recommendation or automatic medical assistants.

A.1 Complete Case Study
Table 7 shows the complete content of the case study in section 5.5.The conversations were conducted by two crowd workers.For training purposes, the utterances from the first crowd worker are denoted as user utterances, whereas those from the second crowd worker are denoted as agent utterances.An MSC model aims to generate responses that are as close as possible to the second crowd worker's utterances.
In history session 1, the user and the agent briefly talk about themselves, such as, their hobbies, and the job of their parents.In history session 2, they talk about the book that they are reading and some memories about reading books.In the current conversation, their discussions vary from the book 1984 to the plan of hiking and picnic.When the user discusses preparing sandwiches and lemonade with the agent, "I can make sandwiches for us!I also have a very good recipe for homemade lemonade!Do you like lemonade?",HAHT can remember information mentioned in the history conversations, such as the user likes reading and outdoor activities and it has adopted a book-lover persona before.HAHT can leverage these historical contexts and generate more human-like, context-relevant, and history-aware responses: "I love lemonade!I'm sure we can find a lot of good recipes for sandwiches too.Sandwiches and lemonade are perfect for going outdoors and reading books.".
is my favorite.He likes watching TV with me.Agent Aww, that 's really cute.I love cats too.Session 2 Conversation (after several days) User You will never guess what Sonny did yesterday !Agent Watched the next episode without you ?LOL.What did that fluffball do ?

Figure 1 :
Figure 1: An illustrated example of a two-session conversation between a user and an agent.

Figure 2 :
Figure 2: The overall structure of the proposed HAHT model, which contains 1) hierarchical history conversation encoder, 2) history-aware context encoder, and 3) history-aware response generator.The details of each component are shown in Figure 3, 4, 5, respectively.

Figure 3 :
Figure 3: The structure of the hierarchical history conversation encoder in HAHT.

Table 5 :
The performance achieved by HAHT and different HAHT variants.Session i indicates there are i-1 history conversation sessions.B-2, B-3, and R-L denote BLEU-2, BLEU-3, and Rouge-L respectively.The best results are in boldface.
where n i is the turn number of H i .Next, we apply a User: we should … we … together .

Table 6 :
A case study of an MSC with two history conversations.Only important utterances in the history and current conversations are presented.Complete conversations sessions are provided in Appendix A.1