Generating Extractive Answers: Gated Recurrent Memory Reader for Conversational Question Answering

,


Introduction
Recently, large language models (LLMs) like ChatGPT (OpenAI, 2022) and GPT-4 (OpenAI, 2023) have revolutionized the question-answering and conversation domains, pushing them to new heights.Different from ChatGPT-style conversations, the task of conversational question answering (CQA) requires models to answer follow-up questions using extracted answers based on given passages and conversation history.It can be regarded as an expansion of traditional singleturn machine reading comprehension (MRC) to multi-turn conversations.However, the followup questions usually have more complicated phenomena, such as co-reference, ellipsis and so on.It is necessary to consider historical memory in a conversation.To enable machines to answer such questions, many CQA datasets, such as CoQA ( Reddy et al., 2018), QuAC (Choi et al., 2018) and QBLink (Elgohary et al., 2018), are proposed.Generally, a dialogue contains a long passage and some short questions about this passage.The current question may rely on previous questions or answers.Here is an example from CoQA dataset in Figure 1.Only if we know the conversation history can we understand that the third question "What?" represents "What is she carrying?".However, most previous approaches view this task as a traditional single-turn MRC task by concatenating previous questions or answers as conversation history superficially, such as BiDAF++ (Yatskar, 2018), DrQA+PGNet (Reddy et al., 2018), SDNet (Zhu et al., 2018) and so on.They can not grasp and understand the representation of history profoundly.And multiple historical questions and answers in one sentence may confuse the model.
Besides, these methods occupy a lot of storage space or precious and limited graphics memory because of duplicated questions and passages.Although, FlowQA (Huang et al., 2018) proposes a flow mechanism without concatenating previous questions, the hidden states of the passage are still duplicated many times to obtain the question-aware passage for each question.Meanwhile, it can not utilize dialogue history selectively.

Related Work
In the field of generative conversations, large language models have become the prevailing approach.These models (OpenAI, 2022(OpenAI, , 2023;;Zhang and Yang, 2023b) are typically pretrained using vast amounts of unsupervised text data and subsequently fine-tuned using supervised instruction data.This supervised instruction data is often obtained through human annotations or distilled from existing large-scale models (Zhang and Yang, 2023a).
In the field of extractive conversations, the answers are typically extracted directly from the original passages by the model.This task is often seen as an extension of single-turn machine reading comprehension to multi-turn conversations.Prior to the era of pretraining, attention mechanisms were commonly employed in various subdomains of question answering, such as BiDAF (Seo et al., 2017) and Rception (Zhang and Wang, 2020) in classic single-turn machine reading comprehension, BiDAF++ (Yatskar, 2018) in conversational machine reading comprehension, and other methods based on multi-modal or structured knowledge question answering (Zhang, 2020;Zhang and Yang, 2021b).
Subsequently, pre-trained models based on Transformer (Vaswani et al., 2017) like ELMo (Peters et al., 2018) or BERT (Devlin et al., 2018) are employed in a wide range of natural language processing tasks (Zhang and Yang, 2021a;Zhang et al., 2023).In question answering tasks, these pre-trained models are either combined with designed attention structures through embeddingstyle (Zhu et al., 2018;Zhang, 2019) or directly fine-tuned using concatenated historical dialogues.
However, regardless of the method used, the previous dialogue history needs to be encoded and interacted with in each turn of the conversation.It is not possible to save the previous dialogue states and directly use them.

Task Formulation
In this section, we will illustrate our model from encoder to decoder.The task of the CQA can be formulated as follows.Suppose we are given a conversation, which contains a passage with n tokens P = {w P t } n t=1 and multi-turn questions with s question answering turns And the model requires to give the corresponding answer A r .A conversation in the dataset can be considered as a tuple (P, Q, A).

Gated Recurrent Memory Reader
As shown in Figure 2, we use a generalized seq2seq framework to solve the task of CQA.As we know, the traditional seq2seq structure is used for many tasks, such as machine translation, semantic parsing, and so on.Given a sentence, the encoder will transform the input to the intermediate representation, which is used by the decoder to generate a new sentence word by word.
Similar to the seq2seq framework, our model also consists two modules: passage encoder and question decoder.the passage encoder module is designed to generate the hidden representation according to the passage.And question decoder module is for generating extractive answers turn by turn according to the result of the encoder.
There are also many differences between our model and seq2seq.First, the basic unit of our model is sentences rather than words.The intermediate representation (generated by the passage encoder) of our model contains the hidden states of all tokens in the passage.The input and the output of the decoder is also sentences.And the parallelism of our model is between conversations rather than sentences in a batch.Second, the output length of the question decoder depends on the turns of questions in a conversation.There is no start flag "⟨GO⟩" or end flag "⟨EOS⟩" in the decoder.Third, the output of the decoder, i.e., the answer, is not fed to the input of the next turn.Because the hidden states of the answer is in the memory of our model.
Our model can be formulated in Eq. 1.The decoder takes the result of the encoder P = f enc (P), and generates extractive answers according to questions. where And P, Q, A is the embedding result of P, Q, A, respectively.For one conversation with s question answering turns, P ∈ R 1×n×h , Q ∈ R s×m×h and A ∈ R s×m×h , where h denotes the dimension of the embedding.

Passage Encoder Module
This module aims to encode the words of the passage into latent semantic representation, which will be used in the question decoder module.First, we obtain the embedding of the passage e GLV t by the pre-trained Glove (Pennington et al., 2014).The part-of-speech (POS) and named entity recognition (NER) tags of each word are also transformed to the embedding vectors e POS

Question Decoder Module
This module is the core of our model.We take a turn of one conversation as an example to illustrate this module.And "turn" in our model is similar to "step" in RNN.
Question Input Layer Suppose the current question is the r-th question, we encode the embedding of the question {e with a bi-directional RNN.We can obtain a vector of the question by weighted sum of tokens in Eq. 2. And w i represents different trainable weights in this section.
Gated Memory Layer (Passage) In this layer, we use a gated mechanism to leverage the information between the origin passage e P and the memory of the passage c P r−1 , inspired by memory network (Kumar et al., 2016).The memory c P r−1 is obtained from the previous question turn.The r-th history-aware passage c P r can be obtained as follows: )) where c psum r−1 is also obtained from the previous question turn.They will be interpreted in the next layer.
Specially, for the first question turn of the conversation, we directly use the intermediate representation, which is generated by the passage encoder, i.e., c P r = e P .Interaction Layer The passage and the r-th question will interact in this layer.First, we obtain the exact match feature êmatch ]).Then another attention is used to fuse the question and the passage as follows: After that, we refine the representation of the passage by a bi-directional RNN, i.e., cP r,t = BiRNN(c P r,t−1 , [c P r,t ; h att r,t ]).Meanwhile, selfattention is also used to enhance the representation in Eq. 6.Then a bi-directional RNN integrates the representation above and generate the new c P r,t in Eq. 7, which will be used by next turn in Eq. 3.  Lastly, we can get the representation of the passage c psum r by weighted sum of tokens like Eq. 2. It will be used both on gated memory layer in the next turn (in Eq. 3) and on gated memory layer for the current question (in Eq. 8).And the answer layer also uses the representation.Gated Memory Layer (Question) As shown in Eq. 8, another gated memory is used to leverage the information of the current question ĉqsum ).Specialy, for the first question turn, we use the representation of the current question as the question memory, i.e., h qsum r = c qsum r .Then h qsum r will be used in the answer layer and the next turn. )) Answer Layer This layer is the top layer of the question decoder module.Following pointer network (Vinyals et al., 2015) and DrQA (Chen et al., 2017), we use the bilinear function f (x, y) = xW y to compute the probabilities of each token being start and end.
As shown in Eq. 9, we can obtain the start probability of each token p s t in passage.

Dataset and Metric
We use the CoQA (Reddy et al., 2018) as our evaluation dataset.It is a large-scale conversational question answering dataset notated by people.It contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains.The text passages in the dataset are collected from seven diverse domains.
And we use the F1 as the metric as the official evaluation.

Implementation Details
We use the Adamax (Kingma and Ba, 2014) as our optimizer.The initial learning rate is 0.004, and is halved after 10 epochs.
We also conduct ablation studies for our model in Table 2.We can find both the gated mechanism in memory and gated memory are crucial to our architecture.The score drops a lot without all gated memories.Different from the passage, the effect of the gated mechanism in question memory is not included, because our model can only use the current question without gate.
Lastly, we compare the storage space occupancy between our model and others on the CoQA dataset in Table 3.Our model takes up the least space with 0-ctx.Other models usually append 2 or 3 historical conversation turns to the current question.For the words in the training dataset, we can observe that the space they used is about four times larger than ours.And the difference will be larger when words are converted to vectors.

Conclusion
We propose a novel structure, GRMR, for conversational question answering.Traditional extractive MRC models are integrated into a generalized sequence-to-sequence framework.Gated mechanism and recurrent memory enable the model to consider the latent semantics of conversation history selectively and deeply with less space.The experiments show that this is a successful attempt to integrate extraction and generation in conversational question answering.

t
and e NER t , respectively.They are learned during training.Then we concatenate them to a vector and feed it into the bi-directional recurrent neural network (RNN) to generate an intermediate representation of the passage e P t = BiRNN(e P t−1 , [e GLV t

Figure 2 :
Figure 2: Overview of our model.(Best see in colors) (The red solid line refers to the memory flow of the passage.The red dashed line refers to the memory flow of the questions.)

r
and the previous question memory h qsum r−1 , where ĉqsum r is processed by a RNN cell, i.e., ĉqsum r = RNNcell(h qsum r−1 , c qsum r

Table 1 :
The performance on the CoQA dev set.

Table 2 :
Ablation studies on the CoQA dev set.
r] for unanswerable questions.

Table 3 :
Space occupancy on the CoQA.(N-ctx refers to using previous N QA pairs)