Question-Interlocutor Scope Realized Graph Modeling over Key Utterances for Dialogue Reading Comprehension

In this work, we focus on dialogue reading comprehension (DRC), a task extracting answer spans for questions from dialogues. Dialogue context modeling in DRC is tricky due to complex speaker information and noisy dialogue context. To solve the two problems, previous research proposes two self-supervised tasks respectively: guessing who a randomly masked speaker is according to the dialogue and predicting which utterance in the dialogue contains the answer. Although these tasks are effective, there are still urging problems: (1) randomly masking speakers regardless of the question cannot map the speaker mentioned in the question to the corresponding speaker in the dialogue, and ignores the speaker-centric nature of utterances. This leads to wrong answer extraction from utterances in unrelated interlocutors' scopes; (2) the single utterance prediction, preferring utterances similar to the question, is limited in finding answer-contained utterances not similar to the question. To alleviate these problems, we first propose a new key utterances extracting method. It performs prediction on the unit formed by several contiguous utterances, which can realize more answer-contained utterances. Based on utterances in the extracted units, we then propose Question-Interlocutor Scope Realized Graph (QuISG) modeling. As a graph constructed on the text of utterances, QuISG additionally involves the question and question-mentioning speaker names as nodes. To realize interlocutor scopes, speakers in the dialogue are connected with the words in their corresponding utterances. Experiments on the benchmarks show that our method can achieve better and competitive results against previous works.


Introduction
Beyond the formal forms of text, dialogues are one of the most frequently used media that people communicate with others to informally deliver their emotions (Poria et al., 2019), opinions (Cox et al., 2020), and intentions (Qin et al., 2021).Moreover, dialogue is also a crucial information carrier in literature, such as novels and movies (Kociský et al., 2018), for people to understand the characters and plots (Sang et al., 2022) in their reading behaviors.Therefore, comprehending dialogues is a key step for machines to act like humans.
Despite the value of dialogues, reading comprehension over dialogues (DRC), which extracts an- #9 is too long, so we omit some parts of the utterance.
swer spans for independent questions from dialogues, lags behind those of formal texts like news and Wikipedia articles. 2The reason mainly comes from distinctive features of dialogues.Specifically, dialogues involve informal oral utterances which are usually short and incomplete, and thus understanding them highly depends on their loosely structured dialogue context.As a high-profile spot in the conversational-related domain, dialogue context modeling is also a major scientific problem in DRC.
In previous works, Li and Zhao (2021) (abbreviated as SelfSuper) point out that dialogue context modeling in DRC faces two challenges: complex speaker information and noisy question-unrelated context.For speaker information, SelfSuper design a self-supervised task guessing who a randomly masked speaker is according to the dialogue context (e.g., masking "Monica Geller" of #10 in Fig. 1).To reduce noise, another task is made to predict whether an utterance contains the answer.
Although decent performance can be achieved, several urging problems still exist.
Firstly, speaker guessing does not aware of the speaker information in questions and the interlocutor scope.As randomly masking is independent of the question, it cannot tell which speaker in the dialogue is related to the speaker mentioned in the question, e.g., Joey Tribbiani to Joey in Q1 of Fig. 1.As for the interlocutor scope, we define it as utterances said by the corresponding speaker.We point out that utterances have a speaker-centric nature: First, each utterance has target listeners.For example, in Utter.#10 of Fig. 1, it requires to understand that Joey is a listener, so "you had the night" is making fun of Joey from Monica's scope.Second, an utterance reflects the message of the experience of its speaker.For example, to answer Q1 in Fig. 1, it requires understanding "stayed up all night talking" is the experience appearing in Joey's scope.Due to ignoring the question-mentioned interlocutor and its scope, SelfSuper provides a wrong answer.
Secondly, answer-contained utterance (denoted as key utterance by SelfSuper) prediction prefers utterances similar to the question, failing to find key utterances not similar to the question.The reason is that answers are likely to appear in utterances similar to the question.For example, about 77% of questions have answers in top-5 utterances similar to the question according to SimCSE (Gao et al., 2021) in the dev set of FriendsQA (Yang and Choi, 2019).Furthermore, the utterances extracted by the key utterance prediction have over 82% overlaps with the top-5 utterances.Therefore, there are considerable key utterances have been ignored, leading to overrated attention to similar utterances, e.g., Q2 in Fig. 1.In fact, many key utterances are likely to appear near question-similar utterances because contiguous utterances in local contexts tend to be on one topic relevant to the question (Xing and Carenini, 2021;Jiang et al., 2023).However, the single utterance prediction cannot realize this.
To settle the aforementioned problems, so that more answer-contained utterances can be found and the answering process realizes the question and interlocutor scopes, we propose a new pipeline framework for DRC.We first propose a new keyutterances-extracting method.The method slides a window through the dialogue, where contiguous utterances in the window are regarded as a unit.The prediction is made on these units.Based on utter-ances in predicted units, we then propose Question-Interlocutor Scope Realized Graph (QuISG) modeling.QuISG constructs a graph over contextualized embeddings of words.The question and speaker names mentioned in the question are explicitly present in QuISG as nodes.To remind the model of interlocutor scopes, QuISG connects every speaker node in the dialogue with words from the speaker's scope.We verify our model on two popular DRC benchmarks.Our model achieves decent performance against baselines on both benchmarks, and further experiments indicate the efficacy of our method.

Related Work
Dialogue Reading Comprehension.Unlike traditional Machine Reading Comprehension (Rajpurkar et al., 2016), Dialogue Reading Comprehension (DRC) aims to answer a question according to the given dialogue.There are several related but different types of conversational question answering: CoQA (Reddy et al., 2018) conversationally asks questions after reading Wikipedia articles.QuAC (Choi et al., 2018) forms a dialogue of QA between a student and a teacher about Wikipedia articles.DREAM (Sun et al., 2019) tries to answer multi-choice questions over dialogues of English exams.These works form QA pairs as a conversation between humans and machines.To understand the characteristics of speakers, Sang et al. (2022) propose TVShowGuess in a multi-choice style to predict unknown speakers in dialogues.
Conversely, we focus on DRC extracting answer spans from a dialogue for an independent question (Yang and Choi, 2019).For DRC, Li and Choi (2020) propose several pretrained and downstream tasks on the utterance level.To consider the coreference of speakers and interpersonal relationships between speakers, Liu et al. (2020) introduce the two types of knowledge from other dialoguerelated tasks and construct a graph to model them.Besides, Li et al. (2021); Ma et al. (2021) model the knowledge of discourse structure of utterances in the dialogues.To model the complex speaker information and noisy dialogue context, two selfsupervised tasks, i.e., masked-speaker guessing and key utterance prediction, are utilized or enhanced by Li and Zhao (2021); Zhu et al. (2022); Yang et al. (2023).However, existing work ignores explicitly modeling the question and speaker scopes and suffers from low key-utterance coverage.Dialogue Modeling with Graph Representations.
In many QA tasks (Yang et al., 2018;Talmor et al., 2019), graphs are the main carrier for reasoning (Qiu et al., 2019;Fang et al., 2020;Yasunaga et al., 2021).As for dialogue understanding, graphs are still a hotspot for various purposes.In dialogue emotion recognition, graphs are constructed to consider the interactions between different parties of speakers (Ghosal et al., 2019;Ishiwatari et al., 2020;Shen et al., 2021).In dialogue act classification, graphs model the cross-utterances and cross-tasks information (Qin et al., 2021).In dialogue semantic modeling, Bai et al. (2021) extend AMR (Banarescu et al., 2013) to construct graphs for dialogues.As for DRC, graphs are constructed for knowledge propagation between utterances by works (Liu et al., 2020;Li et al., 2021;Ma et al., 2021) mentioned above.

Task Definition
Given a dialogue consisting of N utterances: D = [utter 1 , utter 2 , ..., utter N ], the task aims to extract the answer span a for a question q = [qw 1 , qw 2 , ..., qw Lq ] from D, where qw i is the i-th word in q and L q is the length of q.In D, each utterance utter i = {speaker : s i , text : t i } contains its corresponding speaker (e.g., s i ="Chandler Bing") and text content where tw j the j-th word in t i and L i is the length of t i .For some unanswerable questions, there is no answer span to be found in D. Under such a circumstance, a is assigned to be null.

Conversational Context Encoder
To encode words contextually using pretrained models (PTM), following previous work (Li and Zhao, 2021), we chronologically concatenate utterances in the same conversation to form a text sequence: Holding the conversational context C, PTM can deeply encode C with the question q to make it question-aware by concatenating them as QC = "[CLS] q [SEP] C [SEP]" (it is okay that C goes first).Following Li and Zhao (2021), we utilize the ELECTRA discriminator to encode the sequence QC: where between q and C, where L C is the length of C.

Key Utterances Extractor
Treating every single utterance as a unit to pair with the question prefers utterances similar to the question.However, the utterance containing the answer is not always in the case, where it can appear near a similar utterance within several steps due to the high relevance of local dialogue topics.
The key utterance extractor aims to extract more answer-contained utterances.We apply a window along the dialogue.Utterances in the window are treated as a unit so that the similar utterance and the answer-contained utterance can co-occur and more answer-contained utterances can be realized.

Training the Extractor
With the window whose size is m, [utter i , utter i+1 , ..., utter i+m ] is grouped.Mapping the start (st i ) and end (ed i ) position of the unit in C, the representation of the unit can be computed by: Similarly, the representation of the question is computed by The correlation score between them is then computed by: where Linear(•) is a linear unit mapping the dimension from R 2d h to R. For the unit, if any utterances in it contain the answer, the label y k i of this unit is set to 1, otherwise 0. Therefore, the training objective of the key utterances extractor on the dialogue D is:

Extracting Key Utterances
The extractor predicts whether a unit is related to the question.If y i > 0.5, the unit is regarded as a question-related unit, and utterances inside are all regarded as key utterances.To avoid involving too many utterances as key utterances, we rank all the units whose y i > 0.5 and pick up top-k units.For a question q, we keep a key utterance set key = (•) to store the extracted key utterances.Specifically, when the i-th unit satisfies the above condition, [utter i , ..., utter i+m ] are all considered to be added into key.If utter i does not exist in key, then key.add(utter i ) is triggered, otherwise skipped.After processing all the qualified units, key sorts key utterances by sort(key, 1→N), where 1→N denotes chronological order.
We obverse that, in most cases, key utterances in key are consecutive utterances.When k=3 and m=2, the set is ordered as (utter i−m , ..., utter i , ..., utter i+m ), where utter i is usually the similar utterance.

Question-Interlocutor Scope Realized Graph Modeling
To guide models to further realize the question, speakers in the question, and scopes of speakers in D, we construct a Question-Interlocutor Scope Realized Graph (QuISG) based on key.QuISG is formulated as G = (V, A), where V denotes the set of nodes and A denotes the adjacent matrix of edges.After the construction of QuISG, we utilize a node-type realized graph attention network to process it.We elaborate on QuISG below.

Nodes
We define several types of nodes for key utterances and the question.Question Node: Question node denotes the questioning word (e.g., "what") of the question.The node representation is initialized by meanpooling the representations of the question words: ).We denote this type of node as v.t=qw.
Question Speaker Node: Considering speakers in the question can help models realize which speakers and their interactions are focused by the question.Question speaker node is derived from the speaker name recognized from the question.We use stanza (Qi et al., 2020) 3 performing NER to recognize person names (e.g."ross") in the question and pick up those names appearing in the dialogue as interlocutors.Then, we have v.rep=H Q [ross] and v.t=qs.Additionally, if a question contains no speaker name or the picked name does not belong to interlocutors in the dialogue, no question speaker node will be involved.
Dialogue Speaker Node: Speakers appearing in the dialogue are crucial for dialogue modeling.We construct speakers of key utterances as dialogue speaker nodes.As the speaker in the dialogue is identified by its full name (e.g., "Ross Gellar"), we compute the node embedding by meanpooling the full name and all key utterances of the speaker will provide its speaker name: v.rep=mean(H C [Ross 1 , Gellar 1 , ..., Ross x , Gellar x ]), where x is the number of key utterance whose speaker name is "Ross Gellar".We set v.t=ds.
Dialogue Word Node: As the main body to perform answer extraction, words from all key utterances are positioned in the graph as dialogue word nodes.The embedding is initialized from the corresponding item of H C .This type is set to v.t=dw.
Scene Node: In some datasets, there is a kind of utterance that appears at the beginning of a dialogue and briefly describes the scene of the dialogue.If it is a key utterance, we set words in it as scene nodes.Although we define the scene node, it still acts as a dialogue word node with v.t=dw.The only difference is the way to connect with dialogue speaker nodes.We state it in Sec.3.4.2.

Edges
Edges connect the defined nodes.The adjacent matrix of edges is initialized as A = O.As QuISG is an undirected graph, A is symmetric.We denote For the word node v x ∈ utter i , we connect it with other word nodes v y ∈ utter i (x − k w ≤ y ≤ x + k w ) within a window whose size is k w , i.e., A[v x , v y ] = 1.For word nodes in other utterances (e.g., v z ∈ utter i+1 ), no edge is set between v x and v z .To remind the model of the scope of speakers, we connect every word node with the dialogue speaker node v s i it belongs to, i.e., A[v x , v s i ] = 1.To realize the question, we connect all word nodes with the question node v q , i.e., A[v x , v q ] = 1.
For the speakers mentioned in the question, we fully connect their question speaker nodes to model interactions between these speakers, e.g., A[v qsm , v qsn ] = 1.To remind the model which speaker in dialogue is related, we connect the question speaker node v qsm with its dialogue speaker node v s i , i.e., A[v qsm , v s i ] = 1.Furthermore, question speaker nodes is connected with the question node, e.g., A[v qsm , v q ] = 1.
If the scene description is selected as a key utterance, it will be regarded as an utterance without speaker identification.We treat a scene node as a word node and follow the same edge construction as word nodes.As the scene description may tell things about speakers, we utilize stanza to recognize speakers and connect all scene nodes with the corresponding dialogue speaker nodes.
For every node in QuISG, we additionally add a self-connected edge, i.e., A[v, v] = 1.

Node-Type Realized Graph Attention Network
Node-Type Realized Graph Attention Network (GAT) is a T -layer stack of graph attention blocks (Velickovic et al., 2017).The input of GAT is a QuISG and GAT propagates and aggregates messages between nodes through edges.
We initial the graph representation by h 0 v = v.rep.A graph attention block mainly performs multi-head attention computing.We exemplify attention computing by one head.To measure how important the node v n to the node v m , the node type realized attentive weight is computed by: α mn = exp (LReLU (c mn )) vo∈Nv m exp (LReLU (c mo )) , ( 5) where r vm.t ∈ R 1×4 is a one-hot vector denoting the node type of v m , and a ∈ R 1×2d head , w q ∈ R (d head +4)×d head , w k ∈ R (d head +4)×d head are trainable parameters.Furthermore, the graph attention block aggregates the weighted message by: where W o ∈ R d head ×d head is a trainable parameter.By concatenating weighted messages from all heads, the t-th graph attention block can update the node representation from h t−1 vm to h t vm .

Answer Extraction
After graph modeling, nodes in the QuISG are then mapped back into the original token sequence.We locate the dialogue word (scene) node v x to its corresponding token representation where w srt ∈ R 1×L C , w end ∈ R 1×L C are trainable parameters.For the answer span a, we denote its start index and end index as a st and a ed .Therefore, the answer extracting objective is: If there are questions without any answers, another header is applied to predict whether a question is answerable.The header computes the probability by p na = sigmoid(Linear(H ′ C [CLS])).By annotating every question with a label q ∈ {0, 1} to indicate answerability, another objective is added: In this way, the overall training objective is J = J ax + 0.5 * J na .

Inference
Following Li and Zhao (2021), we extract the answer span by performing a beam search with the size of 5. We constrain the answer span in one utterance to avoid answers across utterances.To further emphasize the importance of key utterances, we construct a scaling vector S ∈ R 1×L C , where the token belonging to key utterances is kept with 1 and the token out of key utterances is assigned with a scale factor 0 ≤ f ≤ 1.The scaling vector is multiplied on Y srt and Y end before softmax, and we then use the processed possibilities for inference.

Experimental Settings
Datasets.Following Li and Zhao (2021), we conduct experiments on FriendsQA (Yang and Choi, 2019) and Molweni (Li et al., 2020).As our work does not focus on unanswerable questions, we construct an answerable version of Molweni (Molweni-A) by removing all unanswerable questions.
Compared Methods.We compare our method with existing methods in DRC.ULM+UOP (Li and Choi, 2020) adapt several utterance-level tasks to pretrain and finetune BERT in the multitask setting.KnowledgeGraph (Liu et al., 2020) introduces and structurally models additional knowledge about speakers' co-reference and social relations from other related datasets (Yu et al., 2020).

Model EM F1
BERT based ULM+UOP (Li and Choi, 2020) 46.80 63.10 KnowledgeGraph (Liu et al., 2020)  For key utterances extraction, the size of the window (i.e., m) is set to 2 and top-3 units are considered.Other hyper-parameters are the same as those in the question-answering training.For question answering, we search the size of the word node window (i.e., k w ) in 1, 2, 3, and the number of attention heads in 1, 2, 4. We set the number of GAT layers to 5 for FriendsQA and 3 for Molweni; f is set to 0.5 for FriendsQA and 0.9 for Molweni.Other hyper-parameters are in Appendix A. We use the Exact Matching (EM) score and F1 score as the metrics.
5 Results and Discussion

Main Results
Tab. 1 shows the results achieved by our method and other baselines on FriendsQA.The baselines listed in the first three rows are all based on BERT.We can see that SelfSuper achieves better or competitive results compared with ULM+UOP and KnowledgeGraph This indicates the effectiveness of the self-supervised tasks for speaker and key utterance modeling of SelfSuper.When it comes to ELECTRA, the performance reaches a new ele-Model EM F1 BERT based DADGraph (Li et al., 2021) 46.50 61.50 SelfSuper (Li and Zhao, 2021) 49.20 64.00 ELECTRA based Our Reimpl.ELECTRA 57.85 72.17 Reimpl.EKIM (Zhu et al., 2022)   vated level, which shows that ELECTRA is more suitable for DRC.By comparing with SelfSuper and EKIM, our method can achieve significantly better performance.This improvement shows the advantage of both the higher coverage of answercontained utterances by our method and better graph representations to consider the question and interlocutor scopes by QuISG.
Results on Molweni are listed in Tab. 2. Our approach still gives new state-of-the-art, especially a significant improvement in EM scores.However, the absolute improvement is smaller compared to that of FriendsQA.This is mainly for two reasons.First, the baseline results are close to the human performance on Molweni, so the space for improvement is smaller.Second, Molweni contains unanswerable questions, which are not the main focus of our work.To see how the unanswerable questions affect the results, we further show the performance of our method and baselines on Molweni-A in Tab. 3, i.e., the subset of Molweni with only answerable questions.We observe that our method still achieves a better EM score against baselines and gains a slightly better F1 score, which indicates that our method can better deal with questions with answers.As for unanswerable questions, we believe that better performance can be achieved with related techniques plugged into our method, which we leave to future work.
By comparing the performance of our method in FriendsQA and Molweni, we can observe that our method is more significant in FriendsQA.We think the reason may be that (1) our key utterance extrac- tor can cover more answer-contained utterances in FriendsQA, as will be shown in Fig. 3; (2) questions mentioning speakers show more frequently in FriendsQA than in Molweni, and therefore QuISG can help achieve better graph representations in FriendsQA.On all accounts, this further demonstrates that our method alleviates the problems that we focus on.

Ablation Study
To demonstrate the importance of our proposed modules, we adapt an ablation study.The results are shown in Tab. 4. We study the effects of node type information (NodeType), key utterances extractor and its scaling factor on logits (KeyUttExt); question and question speaker nodes (Q); edges between dialogue word nodes and dialogue speaker nodes to model interlocutor scope (SpkScope).We further remove both KeyUttExt and QuISG, leading to full connections between every two tokens in dialogues, and apply transformer layers to further process dialogues (w/o All).By removing NodeType, the performance drops, which demonstrates minding different node behaviors can help better model graph representations.Our method w/o KeyUttExt decreases the performance, which demonstrates that the key utterance extractor is a crucial module for our method to find more answer-contained utterances and guides our model to pay more attention to the key part in a dialogue.As for the model w/o KeyUttExt shows more performance drop in FriendsQA, we think the reason may be that dialogues in FriendsQA are much longer than Molweni.Therefore, KeyUt-tExt can reduce more question-unrelated parts of dialogues for further graph modeling in Friend-sQA.Removing Q or SpkScope also shows a performance decline, which indicates the importance of realizing the question and interlocutor scopes.Replacing KeyUttExt and QuISG with transformer layers even performs worse than ELECTRA, which  indicates that the further process of dialogues without speaker and question-realized modeling is redundant.

Accuracy of Utterance Extraction
As we claim that our method covers more answercontained utterances compared with SelfSuper (EKIM has a similar result as SelfSuper), in this section, we show the recall of answer-contained utterances by different methods.Besides our method and SelfSuper, we further consider retrieval methods appearing in other reading comprehension tasks.As the similarity-based seeker is usually used, we apply the SOTA model SimCSE (Gao et al., 2021) to compute the similarity between utterances and the question.However, directly using top similar utterances produces an extremely low recall.Therefore, we also add utterances around every picked top utterance as key utterances like ours.We consider top 3 similar utterances and 4 context utterances around them.The results are illustrated in Fig. 3.As shown in Fig. 3  To further show that our method is more suitable for DRC against SimCSE, we run a variant with key utterances extracted by SimCSE.The results are shown in Tab. 5. Our method achieves better performance with high coverage of answer-contained utterances and fewer key utterances.

Improvement on Questions with Speakers
As QuISG focuses on the question speaker information and dialogue interlocutor scope modeling, whether it can help answer questions mentioning speaker names is crucial to verify.We illustrate F1 scores of questions containing different speakers in FriendsQA and questions with or without mentioning speakers in Fig. 4. We can see that Self-Super outperforms our method only on "Rachel" and is slightly better on "Joey" and "Monica".Our method can outperform SelfSuper by a great margin on "Ross", "Phoebe", "Chandler", and other casts.Furthermore, our method can improve the F1 score of speaker-contained questions by a wider margin compared to questions without speakers.This indicates that our speaker modeling benefits from our proposed method.

Case Study
At the very beginning of the paper, Fig. 1 provides two cases in that SelfSuper fails.On the contrary, attributing to our proposed key utterances extractor and QuISG, our method can answer the two questions correctly.

Conclusion
To cover more key utterances and make the model realize speaker information in the question and interlocutor scopes in the dialogue for DRC, we propose a new pipeline method.The method firstly adapts a new key utterances extractor with contiguous utterances as a unit for prediction.Based on utterances of the extracted units, a Question-Interlocutor Scope Realized Graph (QuISG) is constructed.QuISG sets question-mentioning speakers as question speaker nodes and connects the speaker node in the dialogue with words from its scope.Our proposed method achieves decent performance on related benchmarks.

Limitation
As our method does not focus on dealing with unanswerable questions, our method may not show a great advantage over other methods when there are a lot of unanswerable questions.How to improve the recognition of this type of question, avoid overrating further modeling on them, and therefore give more accurate graph modeling on answerable questions will be left to our future work.Besides, our speaker modeling prefers questions focusing on speakers, and it may show limited improvement if a dataset contains few speaker-related questions.However, speakers are key roles in dialogues, and therefore, questions about speakers naturally appear frequently in DRC.
The power of our key utterance extraction method to other QA fields remains unknown.It can be future work to extend it to other reading comprehension tasks like NarrativeQA (Kociský et al., 2018).
Our method does not involve additional knowledge, such as speakers' co-reference and relations (Liu et al., 2020), discourse structures of dialogues (Li et al., 2021;Ma et al., 2021), and decoupled bidirectional information in dialogues (Li et al., 2022).These types of knowledge, which are orthogonal to our work, are key components of dialogues.Therefore, making full use of the additional knowledge in dialogues with our graph modeling can be an interesting direction to explore.

A Computation Resource and Other Setup
We use a piece of NVIDIA GeForce 3090 whose memory size is 24GB.All experiments require memory that is not more than 24GB.It takes 10-25 minutes for our model to finish an epoch training.
As for other hyperparameters used in our experiment, we follow Li and Zhao (2021) to set the learning rate to 4e-6 for FriendsQA and search learning rate from [1.4e-5, 1.2e-5, 1e-5, 8e-6] for Molweni (Molweni-A).The batch size is set to 4 for FriendsQA and 8 for Molweni (Molweni-A).The number of epochs is set to 3 for FriendsQA and 5 for Molweni (Molweni-A).The evaluation is made every 1/5 epoch for FriendsQA and 1/2 epoch for Molweni (Molweni-A).For GAT, the dropout is set to 0.1.During the training process, the learning rate linearly warms up with the portion of 0.01 to all steps and then linearly decays to zero.AdamW with adam epsilon of 1e-6 is utilized as the optimizer.4 runs are adapted and the max one is picked.
For the utilization of SimCSE, Transformers version of sup-simcse-roberta-large, which achieves the best performance among all SimCSE variants on Avg.STS, is picked.

Figure 1 :
Figure 1: Two questions with related dialogue clips that the baseline SelfSuper (Li and Zhao, 2021) fails.Utter.#9 is too long, so we omit some parts of the utterance.

Figure 2 :
Figure 2: The overall framework of our proposed model.We first encode the dialogue and the question by pretrained models.The key utterances extractor takes contiguous utterances as a unit to extract key utterances.Based on extracted key utterances, the question-interlocutor scope realized graph is constructed.

Figure 3 :
Figure3: Recall of answer coverage using different key utterance extracting methods.

Figure 4 :
Figure 4: F1 scores of answers to the questions with or without speaker names in the dev set of FriendsQA.
and then update the token representation by H C [utter i [x]] += h T vx .For the speaker token representation H C [Ross i , Gellar i ] in key utterances, the mapped dialogue speaker node v s i updates it by H C [Ross i , Gellar i ] += [h T

Table 1 :
Results on FriendsQA.* denotes significance against SelfSuper with the t-test.
Implementation.Our model is implemented based on ELECTRA-large-discriminator from Transformers.

Table 5 :
Results of our method and the variant with SimCSE searching for key utterances.