Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance for Multi-party Dialogue Reading Comprehension

Multi-party dialogue machine reading comprehension (MRC) brings tremendous challenge since it involves multiple speakers at one dialogue, resulting in intricate speaker information flows and noisy dialogue contexts. To alleviate such difficulties, previous models focus on how to incorporate these information using complex graph-based modules and additional manually labeled data, which is usually rare in real scenarios. In this paper, we design two labour-free self- and pseudo-self-supervised prediction tasks on speaker and key-utterance to implicitly model the speaker information flows, and capture salient clues in a long dialogue. Experimental results on two benchmark datasets have justified the effectiveness of our method over competitive baselines and current state-of-the-art models.


Introduction
Dialogue machine reading comprehension (MRC, Hermann et al., 2015) aims to teach machines to understand dialogue contexts so that solves multiple downstream tasks (Yang and Choi, 2019;Lowe et al., 2015;Wu et al., 2017;Zhang et al., 2018). In this paper, we focus on question answering (QA) over dialogue, which tests the capability of a model to understand a dialogue by asking it questions with respect to the dialogue context. QA over dialogue is of more challenge than QA over plain text (Rajpurkar et al., 2016;Reddy et al., 2019;Yang and Choi, 2019) owing to the fact that conversations are full of informal, colloquial expressions and discontinuous semantics. Among this, multi-party dialogue brings even more tremendous challenge compared to two-party dialogue (Sun et al., 2019;Cui et al., 2020) since it involves *Corresponding author. This paper was partially supported by Key Projects of National Natural Science Foundation of China (U1836222 and 61733011).  multiple speakers at one dialogue, resulting in complicated discourse structure  and intricate speaker information flows. Besides this,  also pointed that for long dialogue contexts, not all utterances contribute to the final answer prediction since a lot of them are noisy and carry no useful information.
To illustrate the challenge of multi-party dialogue MRC, we extract a dialogue example from FriendsQA dataset (Yang and Choi, 2019) which is shown in Figure 1. This single dialogue involves four different speakers with intricate speaker information flows. The arrows here represent the direction of information flows, from senders to receivers. Let us consider the reasoning process of Q 1 : a model should first notice that it is Rachel who had a dream and locate U 9 , then solve the coreference resolution problem that I refers to Rachel and you refers to Chandler. This coreference knowledge must be obtained by considering the information flow from U 9 to U 8 , which means Rachel speaks to Chandler. Q 2 follows a similar process, a model should be aware of that U 10 is a continuation of U 9 and solves the above coreference resolution problem as well.
To tackle the aforementioned obstacles, we design a self-supervised speaker prediction task to implicitly model the speaker information flows, and a pseudo-self-supervised key-utterance prediction task to capture salient utterances in a long and noisy dialogue. In detail, the self-supervised speaker prediction task guides a carefully designed Speaker Information Decoupling Block (SIDB, introduced in Section 3.4) to decouple speaker-aware information, and the key-utterance prediction task guides a Key-utterance Information Decoupling Block (KIDB, introduced in Section 3.3) to decouple key-utterance-aware information. We finally fuse these two kinds of information and make final span prediction to get the answer of a question.
To sum up, the main contributions of our method are three folds: • We design a novel self-supervised speaker prediction task to better capture the indispensable speaker information flows in multi-party dialogue. Compared to previous models, our method requires no additional manually labeled data which is usually rare in real scenarios. • We design a novel key-utterance prediction task to capture key-utterance information in a long dialogue context and filter noisy utterances. • Experimental results on two benchmark datasets show that our model outperforms strong baselines by a large margin, and reaches comparable results to the current state-of-the-art models even under the condition that they utilized additional labeled data.
2 Related work

Pre-trained Language Models
Recently, pre-trained language models (PrLMs), like BERT (Devlin et al., 2019), RoBERTa , ALBERT (Lan et al., 2019), XLNet  and ELECTRA (Clark et al., 2020), have reached remarkable achievements in learning universal natural language representations by pre-training large language models on massive general corpus and fine-tuning them on downstream tasks (Socher et al., 2013;Wang et al., 2018;Wang et al., 2019;Lai et al., 2017). We argue that the self-attention mechanism (Vaswani et al., 2017) in PrLMs is in essence a variant of Graph Attention Network (GAT, Veličković et al., 2017), which has an intrinsic capability of exchanging information. Compared to vanilla GAT, a Transformer block consisting of residual connection (He et al., 2016) and layer normalization (Ba et al., 2016) is more stable in training. Hence, it is chosen as the basic architecture of our SIDB (Section 3.4) and KIDB (Section 3.3) instead of vanilla GAT.

Multi-party Dialogue Modeling
There are several previous works that study multiparty dialogue modeling on different downstream tasks such as response selection and dialogue emotion recognition. Hu et al. (2019) utilize the response to (@) labels and a Graph Neural Network (GNN) to explicitly model the speaker information flows.  design a pre-training task named Topic Prediction to equip PrLMs with the ability of tracking parallel topics in a multiparty dialogue. Jia et al. (2020) make use of an additional labeled dataset to train a dependency parser, then utilize the dependency parser to disentangle parallel threads in multi-party dialogues. Ghosal et al. (2019) propose a window-based heterogeneous Graph Convolutional Network (GCN) to model the emotion flow in multi-party dialogues.

Speaker Information Incorporation
In dialogue MRC, speaker information plays a significant role in comprehending the dialogue context. In the latest studies,  propose a Mask-based Decoupling-Fusing Network (MDFN) to decouple speaker information from dialogue contexts, by adding inter-speaker and intra-speaker masks to the self-attention blocks of Transformer layers. However, their approach is restricted to two-party dialogue since they have to specify the sender and receiver roles of each utterance. Gu et al. (2020) propose Speaker-Aware BERT (SA-BERT) to capture speaker information by adding speaker embedding at token representation stage of the Transformer architecture, then pre-train the model using next sentence prediction (NSP) and masked language model (MLM) losses. Nonetheless, their speaker embedding lacks of welldesigned pre-training task to refine, resulting in inadequate speaker-specific information. Different from previous models, our model is suitable for the more challenging multi-party dialogue and is equipped with carefully-designed task to better capture the speaker information.
In this part, we will formulate our task and present our proposed model as shown in Figure 2. There are four main parts in our model, a shared Transformer encoder, a key-utterance information decoupling block, a speaker information decoupling block and a final fusion-prediction layer. In the following sections, we will introduce these modules in detail.

Task Formulation
consists of a speaker S i specified by a name and a sequence of words W i speaker S i utters. W i can be denoted as a l i -length sequence {w i1 , w i2 , ..., w i li }. Let a question corresponds to the dialogue context be Q = {q 1 , q 2 , ..., q L }, where L is the length of the question and each q i is a token of the question. Given C and Q, a dialogue MRC model is required to find an answer a for the question, which is restricted to be a continuous span of the dialogue context. In some datasets, a can be an empty string indicating that there is no answer to the question according to the dialogue context.

Shared Transformer Encoder
To fully utilize the powerful representational ability of PrLMs, we employ a pack and separate method as , which is supposed to take advantage of the deep Transformer blocks to make the context and question better interacted with each other. We first pack the context and question as a joint input to feed into the Transformer blocks and separate them according to the position for further interaction.
Given the dialogue context C and a corresponding question Q, we pack them to form a sequence: and [SEP] are two special tokens and each S i :U i pair is the name and utterance of a speaker separated by a colon. This sequence X is then fed into L all L layers of Transformer blocks to gain its contextualized representation E 2 R J⇥d where J is the length of the sequence after tokenized by Byte-Pair Encoding (BPE) tokenizer (Sennrich et al., 2016) and d is the hidden dimension of the Transformer block. Here L all is the total number of Transformer layers specified by the type of the PrLM, L is a hyper-parameter which means the number of decoupling layers.

Key-utterance Information Decoupling Block
Given the contextualized representation E from Section 3.2, follow , we gather the representation of [SEP] tokens from E as the representation of each utterance in the dialogue context. These representations are used to initial- and a question node E q 2 R d as illustrated in the middle-upper part of Figure 2. The representations of normal tokens are gathered as token nodes where n is the number of normal tokens in the dialogue context. Then, another L layers of multi-head self-attention Transformer blocks are used to exchange information inter-and intra-the three types of nodes: are matrices with trainable weights, h is the number of attention heads and [; ] denotes the concatenation operation.
After stacking L layers of multi-head selfattention: MultiHead([E U ; E q ; E T ]) to fully exchange information between these nodes, we get a question representation H q 2 R d , the utterance representations . H q is then paired with each H u i to conduct the key-utterance prediction task. In detail, we use a heuristic matching mechanism proposed by (Mou et al., 2016) to calculate the matching score of the question representation and utterance representation. Here we define a matching function Match(X, Y , activ), where X, Y 2 R d⇤N , as follows: Here denotes element-wise multiplication and a 2 R 4d is a vector with trainable weights. The activ is an activation function to get a probability distribution according to the downstream loss function, which can be chosen from sof tmax and sigmoid. In span-based dialogue MRC datasets, we set the pseudo-self-supervised key-utterance target based on the position of the answer span.  Figure 2: The overview of our model, which contains a shared Transformer encoder, a key-utterance information decoupling block, a speaker information decoupling block and a fusion-prediction layer. In speaker information decoupling block, the bi-directional arrow means that the information flows from and to both sides, the unidirectional arrow means that the information only flows from start nodes to end nodes. We name it pseudo-self-supervised since it is generated from the original span labels, but requires no additional labeled data. Specifically, we set p target = i where i is the index of the utterance that contains the answer span. Then we calculate the key-utterance distribution by: P pred U 2 R N is later expanded to the length of token nodes to get P expand U 2 R n which will be put forward to filter noisy utterances in the fusionprediction layer (introduce in Section 3.5). We adopt cross-entropy loss to compute the loss of this task: The gradient of L U will flow backwards to refine the representations of the utterance nodes so that they can decouple key-utterance-aware information from the original representations. After the interaction between token nodes and utterance nodes, the token nodes will gather key-utterance-aware information from the utterance nodes. Therefore, we denote the token representations as key-utteranceaware: H k T = H T 2 R d⇥n , which will be forwarded to the fusion-prediction layer described in Section 3.5.

Speaker Information Decoupling Block
This part is the core of our model, which contributes to modeling the complex speaker infor-mation flows. In this section, we first introduce the self-supervised speaker prediction task we proposed, then depict the decoupling process of speaker information.

Self-supervised Speaker Prediction
As defined in Section 3.1, we have a dialogue context C = {U 1 , U 2 , ..., U N } where each utterance U i = {S i , W i } consists of a speaker S i specified by a name. We randomly choose an m th utterance and mask its speaker name. Then for every (U i , U m ) pair where i 6 = m, the model should determine whether they are uttered by the same speaker, that is to say, whether S i = S m .
We figure this task a relatively difficult one since it requires the model to have a thorough understanding of the speaker information flows and solve problems such as coreference resolution. Figure 3 is an example of the self-supervised speaker prediction task, where the speaker of the utterance in gray is masked. We human can determine that the masked speaker should be Emily Waltham by considering that Ross and Monica is persuading Emily to attend the wedding by showing her the wedding place, and when Monica and Emily reaches there, it should be Emily who is surprised to say "Oh My God". However, it is not that easy for machines to capture these information flows.

Speaker Information Decoupling
To fully utilize the interactive feature of selfattention mechanism (Vaswani et al., 2017) and the powerful representational ability of PrLMs, we also use Transformer blocks to capture the interactive speaker information flows and fulfill this difficult task.
We first detach E from the computational graph to get E de , then as what we do in Section 3.3, the representation of [SEP] tokens are gathered from E de to initialize N 1 unmasked speaker nodes E S = {E s i 2 R d } N 1 i=1 and a masked speaker node E sm 2 R d . The representation of normal tokens are gathered as token nodes . Then, we add attention mask to the token nodes corresponding to the selected speaker name before they are forwarded into the speaker information decoupling block, as illustrated in the middle-lower part of Figure 2. The reasons why we use this detach-mask strategy are as follows. First, we mask the selected speaker before the speaker information decoupling block instead of at the very beginning before the encoder since it is better to let the utterance decoupling block see all the speaker names. Based on this point, we detach E from the computational graph and add attention mask to avoid target leakage. If we use a normal forward instead, the encoder would simply attend to the speaker names, which would hurt performance (discuss in detail in Section 5.3). Besides, this strategy also helps the model better decouple the key-utterance-aware and speaker-aware infor-mation from the original representations.
In detail, the mask strategy is similar as . We modify Eq. (1) to: Let the start index and end index of the masked speaker tokens be m s and m e , to make the selected speaker name unseen to other nodes, the attention mask is obtained as follows: By adding this mask, other nodes will not attend to the masked token nodes, thus preventing target leakage. On the mean time, the speaker nodes will have to collect clues from other nodes through deep interaction to make prediction, which implicitly models the complex speaker information flows. After stacking L layers of masked multi-head self-attention: MultiHead([E S ; E sm ; E T ], M S ]), we get a masked speaker representation H sm 2 R d , the normal speaker representation H S = {H s i 2 R d } N 1 i=1 , and the token representation . H sm is then paired with each H s i to conduct the self-supervised speaker prediction task. We also adopt the matching function defined in Eq. (2): For convenience and without loss of generality, we make m = N which means we mask the speaker of the N th utterance, in the following description. We construct the self-supervised target by: Then binary cross entropy loss is applied here to compute the loss of this task: The gradient of L S will flow backwards to refine the representations of speaker nodes so that they can decouple speaker-aware information from the original representations. After the interaction between token nodes and speaker nodes, the token nodes will gather speaker-aware information from the speaker nodes. Therefore, we denote the token representations as speaker-aware: H s T = H T 2 R d⇥n , which will be forwarded to the fusion-prediction layer described in next section.

Fusion-Prediction Layer
Given the key-utterance-aware token representation H k T and the speaker-aware token representations H s T , we first fuse these two kinds of decoupled representation using the following transformation: is a linear transformation matrix with trainable weights and Tanh is a non-linear activation function. Then we compute the start and end distributions over the tokens by: where w start and w end are vectors of size R d with trainable weights, P expand U is defined on Section 3.3 and is element-wise multiplication. Given the ground truth label of answer span [a s , a e ], cross entropy loss is adopted to train our model: (12) If the dataset contains unanswerable question, the representation of H f T at [CLS] position x is used to predict whether a question is answerable or not: where w T and b are vectors of size R d with trainable weights. Given the ground truth of answerability t a 2 {0, 1}, binary cross entropy is applied to compute the answerable loss: The final loss is the summation of the above losses:

Benchmark Datasets
We adopt FriendsQA (Yang and Choi, 2019) and Molweni , two span-based extractive dialogue MRC datasets, as the benchmarks. Molweni is derived from the large-scale multi-party dialogue dataset -Ubuntu Chat Corpus (Lowe et al., 2015), whose main theme is technical discussions about problems on Ubuntu system. This dataset features in its informal speaking style and domain-specific technical terms. In total, it contains 10,000 dialogues whose average and maximum number of speakers is 3.51 and 9 respectively. Each dialogue is short in length with the average and maximum number of tokens 104.4 and 208 respectively. Unanswerable questions are asked in this dataset, hence the answerable loss in Eq. (14) is applied. Additionally, this dataset is equipped with discourse parsing annotations which is not used by our model however. To evaluate our model more comprehensively, another open-domain dialogue MRC dataset Friend-sQA is also used to conduct our experiments. FriendsQA excerpts 1,222 scenes and 10,610 opendomain questions from the first four seasons of a well-known American TV show Friends to tackle dialogue MRC on everyday conversations. Each dialogue is longer in length and involves more speakers, resulting in more complicated speaker information flows compared to Molweni. For each dialogue context, at least 4 out of 6 types (5W1H) of questions, are generated. This dataset features in its colloquial language style filled with sarcasms, metaphors, humors, etc.

Implementation Details
We implement our model based on Transformers Library (Wolf et al., 2020). The number of information decoupling layers L is chosen from 3 -5 according to the type of the PrLM in our experiments. For Molweni, we set batch size to 8, learning rate to 1.2e-5 and maximum input sequence length of the Transformer blocks to 384. For FriendsQA, they are 4, 4e-6 and 512 respectively. Note that in FriendsQA, there are dialogue contexts whose length (in tokens) are larger than 512. We split those contexts to pieces and choose the answer with highest span probability p start ⇤ p end as the final prediction 1 .

Baseline Models
For FriendsQA, we adopt BERT as the baseline model follow Li and Choi (2020) and . For Molweni, we follow  who also employ BERT as the baseline model. In addition, we also adpot ELECTRA (Clark et al., 2020) as a strong baseline in both datasets to see if our model still holds on top of stronger PrLMs.

Results
Table 1 shows our experimental results on Friend-sQA. BERT ULM+UOP (Li and Choi, 2020) is a method using pretrain-fine-tune form. They first pre-train BERT on FriendsQA and additional transcripts from Seasons 5-10 of Friends using well designed pre-training tasks Utterance-level-Masked-LM (ULM) and Utterance-Order-Prediction (UOP), then fine-tune it on dialogue MRC task. BERT graph  is a graph-based model that integrates relation knowledge and coreference knowledge using Relational Graph Convolution Networks (R-GCNs) (Schlichtkrull et al., 2018). Note that this model utilizes additional labeled data on coreference resolution (Chen et al., 2017) and character relation . We adopt the same evaluation metrics as    . DAD-Graph  is the current SOTA model that utilizes Graph Convolution Network (GCN) and the additional discourse annotations in Molweni to explicitly model the discourse structure.
We see from the the table that our model outperforms strong baselines and the current SOTA model by a large margin, even under the condition that we do not make use of additional discourse annotations.

Performance Gain Analysis
To get more detailed insights on our proposed method, we analyze the results on different question types of FriendsQA over ELECTRA-based model. Also, we compare our model with the baseline model on these types to see where the performance gains come from. Table 3 shows the results of our model on different question types. Dist. means the distribution of each question type, from which we see that the question type of FriendsQA is nearly uniformly distributed. Performance gains mainly come from question type Who, When and What. We argue that the speaker information decoupling block is the predominant contributor to Who question type since answering this type of question requires the model to have a deep understanding of speaker information flows and solve problems like coreference resolution, which is the same as our self-supervised speaker prediction task. For question type When, the key-utterance information decoupling block contributes the most. The answer of question type When usually comes from a scene description utterance, hence grabbing key-utterance information helps answer this kind of question. Among these improvements, question type Who benefits the most from our model, demonstrating the strong capability of the self-supervised speaker prediction task.

Ablation Study
We conduct ablation study to see the contribution of each module. Table 4 shows the results of our ablation study. Here KIDB and SIDB are the abbreviation of Key-utterance Information Decoupling Type Dist.
To further investigate the effectiveness of our self-supervised speaker prediction task, we design a SpeakerEmb model in which we replace the speaker-aware token representation H s T by speaker representations. The speaker representations are obtained by simply gathering embeddings from a trainable embedding look-up table according to the name of the speaker. Experimental results show that it only makes a slight performance gain compared to SIDB, demonstrating that simply adding speaker information is sub-optimal compared to implicitly modeling speaker information flows using our self-supervised speaker prediction task.

Influence of Detaching Operation
We conduct experiments to investigate the influence of detaching operation mentioned in Section 3.4. As shown in Table 5 Figure 4 illustrates the model performance with regard to the number of speakers and utterances on FriendsQA. At the beginning, the baseline model has similar performance to our model. However, with the number of speakers and utterances increasing, there is a growing performance gap between the baseline model and our model. This observation demonstrates that our SIDB and KIDB have strong abilities to deal with more complex dialogue contexts with a larger number of speakers and utterances.

Context see figure 1
Question: Who was with Rachel in her dream? A baseline : you and I A our model : Chandler Bing

Case Study
To get more intuitive explanations of our model, we select two cases from FriendsQA in which the baseline model fails to answer (F1 = 0, or "exactly not match") but our model is able to answer (exactly match). Figure 5 illustrates two cases where the context of the first one is shown in Figure 1.
In the first case, the baseline model simply predicts that "you and I" were in Rachel's dream while fails to notice that "you" here refers to Chandler. On the contrary, our model is able to capture this information since it helps the speaker prediction task. In fact, if we mask Rachel in U 9 , our model could tell the masked speaker is Rachel, indicating that it knows it should be Rachel who had a dream and U 9 is in response to U 8 . Similar observations can be seen in the second case. The baseline model simply matches the semantic meaning of the question and the context then makes a wrong prediction. Compared with the baseline model, our model has the ability to catch the information flow from Rachel to Monica thus predicts the answer correctly.

Conclusion
In this paper, for multi-party dialogue MRC, we propose two novel self-and pseudo-self-supervised prediction tasks on speaker and key-utterance to capture salient clues in a long and noisy dialogue. Experimental results on two multi-party dialogue MRC benchmarks, FriendsQA and Molweni, have justified the effectiveness of our model.

Acknowledgement
This paper was funded by Chinese National Key Laboratory of Science and Technology on Information System Security, we thank them for their generous support.