A Joint Model for Dropped Pronoun Recovery and Conversational Discourse Parsing in Chinese Conversational Speech

In this paper, we present a neural model for joint dropped pronoun recovery (DPR) and conversational discourse parsing (CDP) in Chinese conversational speech. We show that DPR and CDP are closely related, and a joint model benefits both tasks. We refer to our model as DiscProReco, and it first encodes the tokens in each utterance in a conversation with a directed Graph Convolutional Network (GCN). The token states for an utterance are then aggregated to produce a single state for each utterance. The utterance states are then fed into a biaffine classifier to construct a conversational discourse graph. A second (multi-relational) GCN is then applied to the utterance states to produce a discourse relation-augmented representation for the utterances, which are then fused together with token states in each utterance as input to a dropped pronoun recovery layer. The joint model is trained and evaluated on a new Structure Parsing-enhanced Dropped Pronoun Recovery (SPDPR) data set that we annotated with both two types of information. Experimental results on the SPDPR dataset and other benchmarks show that DiscProReco significantly outperforms the state-of-the-art baselines of both tasks.


Introduction
Pronouns are often dropped in Chinese conversations as the identity of the pronoun can be inferred from the context (Kim, 2000;Yang et al., 2015) without causing the sentence to be incomprehensible. The task of dropped pronoun recovery (DPR) aims to locate the position of the dropped pronoun and identify its type. Conversational discourse parsing (CDP) is another important task that aims to analyze the discourse relations among utterances * Corresponding author  Baselines which ignore the relation "(B3 expands B2) replies A2" mistakenly recover the dropped pronoun 你(you) as 我(I) since the utterance B 3 is considered semantically similar to A 2 . in a conversation, and plays a vital role in understanding multi-turn conversations.
Existing work regards DPR and CDP as two independent tasks and tackles them separately. As an early attempt of DPR, Yang et al. (2015) employ a Maximum Entropy classifier to predict the position and type of dropped pronouns.  and Yang et al. (2019) attempt to recover the dropped pronouns by modeling the referents with deep neural networks. More recently, Yang et al. (2020) attempt to jointly predict all dropped pronouns in a conversation snippet by modeling dependencies between pronouns with general conditional random fields. A major shortcoming of these DPR methods is that they overlook the discourse relation (e.g., reply, question) between conversa-tional utterances when exploiting the context of the dropped pronoun. At the same time, previous CDP methods (Li et al., 2014;Afantenos et al., 2015;Shi and Huang, 2019) first predict the relation for each utterance pair and then construct the discourse structure for the conversation with a decoding algorithm. The effectiveness of these methods are compromised since the utterances might be incomplete when they have dropped pronouns.
To overcome these shortcomings, we propose a novel neural model called DiscProReco to perform DPR and CDP jointly. Figure 1 is a Chinese conversation snippet between two speakers A and B that illustrates the advantages of such a joint approach. In this example, a pronoun "你 (you)" is dropped in utterance B 3 . It is critical for the DPR model to know that both utterances B 2 and B 3 are in reply to the utterance A 2 , when recovering this dropped pronoun. Methods which ignore the structure ("(B3 expands B2) replies A2") will more likely consider the utterance B 3 to be semantically similar to A 2 , and wrongly recover the pronoun as "我 (I)".
Given a pro-drop utterance and its context, Dis-cProReco parses the discourse structure of the conversation and recovers the dropped pronouns in the utterance in four steps: (i) Each utterance is parsed into its dependency structure and fed into a directed GCN to output the syntactic token states. The utterance state is then obtained by aggregating the token states in the utterance. (ii) The utterance states of a conversation are fed into a biaffine classifier to predict the discourse relation between each utterance pair and the discourse structure of the conversation is constructed. (iii) Taking the discourse structure as input, another (multirelational) GCN updates the utterance states and fuses them into the token states for each utterance to produce discourse-aware token representations. (iv) Based on the discourse structure-aware context representation, a pronoun recovery module is designed to recover the dropped pronouns in the utterances. When training this model, all components are jointly optimized by parameter sharing so that CDP and DPR can benefit each other. As there is no public dataset annotated with both dropped pronouns and conversational discourse structures, we also construct Structure Parsing-enhanced Dropped Pronoun Recovery (SPDPR) corpus, which is the first corpus annotated with both types of information. Experimental results show that DiscProReco outperforms all baselines of CDP and DPR.
Contributions: This work makes the following contributions: (i) We propose a unified framework DiscProReco to jointly perform CDP and DPR, and show that these two tasks can benefit each other. (ii) We construct a new large-scale dataset SPDPR (Section 4) which supports fair comparison across different methods and facilitates future research on both DPR and CDP. (iii) We present experimental results which show that DiscProReco with its joint learning mechanism realizes knowledge sharing between its CDP and DPR components and results in improvements for both tasks (Section 5). The code and SPDPR dataset is available at https:// github.com/ningningyang/DiscProReco.

Problem Formulation
We first introduce the problem formulation of these two tasks. Following the practices in (Yang et al., 2015(Yang et al., , 2019(Yang et al., , 2020, we formulate DPR as a sequence labeling problem. DPR aims to recover the dropped pronouns in an utterance by assigning one of 17 labels to each token that indicates the type of pronoun that is dropped before the token (Yang et al., 2015). CDP is the task of constructing the conversational discourse structure by predicting the discourse relation (Xue et al., 2016) among utterances. The discourse relations may characterize one utterance as agreeing with, responding to, or indicate understanding of another utterance in the conversational context.
Let us denote an input pro-drop utterance of n tokens as X = (w 1 , w 2 , · · · , w n ), and its contextual utterances as C = (X 1 , X 2 , · · · , X m ) where the ith contextual utterance X i is a sequence of l i tokens: X i = (w i,1 , · · · , w i,l i ). Our task aims to (1) model the distribution P(X j |X i , C) to predict the relation between each pair of utterances (i.e., (X i , X j )) for CDP, and (2) modelŶ = arg max Y P(Y|X, C) to predict the recovered pronoun sequenceŶ for the input utterance X. Each element ofŶ is chosen from one of the T possible labels from Y = {y 1 , · · · , y T−1 }∪{None} to indicate whether a pronoun is dropped before the corresponding token in utterance X and the type of the dropped pronoun. The label "None" means no pronoun is dropped before this token.

Model Overview
The architecture of DiscProReco is illustrated in Figure 2. Given a pro-drop utterance X and its  Input pro-drop utterance X " Figure 2: Overview of DiscProReco, which explores conversational discourse structures to learn effective referent representations that are used to recover dropped pronouns. DiscProReco consists of four components, and the details are introduced in Section 3.
context C, DiscProReco first represents tokens of these utterances as d-dimensional pre-trained word embeddings (Li et al., 2018), and then feed them into a BiGRU (Chung et al., 2014) network, to represent sequential token states X ∈ R n×d and C ∈ R m×lm×d as the concatenation of forward and backward hidden states outputted from BiGRU. The syntactic dependency encoding layer then revises the sequential token states by exploiting the syntactic dependencies between tokens in the same utterance using a directed GCN and generates utterance representations. After that, the biaffine relation prediction layer predicts the relation between each pair of utterances. The discourse structure then is constructed based on the utterance nodes and the predicted relations. The discourse structure encoding layer further encodes the inter-utterance discourse structures with a multi-relational GCN, and employs the discourse-based utterance representations to revise the syntactic token states. Finally, the pronoun recovery layer explores the referent semantics from the context C and predicts the dropped pronouns in each utterance.

Syntactic Dependency Encoding Layer
As the sequential token states overlook longdistance dependencies among tokens in a utterance, this layer takes in the sequential token states X and C, and revises them as syntactic token states as H X and H C by exploring the syntactic dependencies between the tokens based on a directed GCN. Specifically, for each input utterance in X and C, we first extract syntactic dependencies between the tokens with Stanford's Stanza dependency parser (Qi et al., 2020). Using the output of the dependency parser, we construct a syntactic dependency graph for each utterance in which the nodes represents the tokens and the edges correspond to the extracted syntactic dependencies between the tokens. Following the practices of (Marcheggiani and Titov, 2017;Vashishth et al., 2018), three types of edges are defined in the graph. The node states are initialized by the sequential token states X and C, and then message passing is performed over the constructed graph using the directed GCN (Kipf and Welling, 2017), referred to as SynGCN. The syntactic dependency representation of token w i,n after (k + 1)-th GCN layer is defined as: where W k e ∈ R d×d and b k e ∈ R d are the edgespecific parameters, N + (w i,n ) = N (w i,n ) ∪ {w i,n } is the set of w i,n 's neighbors including itself, and ReLU(·) = max(0, ·) is the Rectified Linear Unit. g k e is an edge-wise gating mechanism which incorporates the edge importance as: whereŵ k e ∈ R 1×d andb k e ∈ R are independent trainable parameters for each layer, and σ(·) is the sigmoid function. The revised syntactic token states H X and H C of the pro-drop utterance and context are outputted for subsequent discourse structure prediction and pronoun recovery.

Biaffine Relation Prediction Layer
For conversational discourse parsing, we jointly predict the arc s (arc) i,j and relation s (rel) i,j between each pair of utterances utilizing the biaffine attention mechanism proposed in (Dozat and Manning, 2017). Given the syntactic token states H X and H C , we make an average aggregation on these token states of each utterance X i to obtain the syntactic utterance representation h X i .
For a pair of utterances (X i , X j ) in the conversation snippet, we feed the representations of these two utterances into a biaffine function to predict the probability of an arc from X i to X j as: where MLP is the multi-layer perceptron that transforms the original utterance representation h X i and h X j into head or dependent-specific utterance states r (arc head) i and r (arc dep) j . U (arc) and u (arc) are weight matrix and bias term used to determine the probability of a arc.
One distinctive characteristics of conversational discourse parsing is that the head of each dependent utterance must be chosen from the utterances before the dependent utterance. Thus we add an upper triangular mask operation on the results of arc prediction to regularize the predicted arc head: We minimize the cross-entropy of gold headdependent pair of utterances as: After obtaining the predicted directed unlabeled arc between each utterance pair, we calculate the score distribution s (rel) i,j ∈ R k of each arc X i → X j , in which the t-th element indicates the score of the t-th relation as the arc label prediction function in (Dozat and Manning, 2017). In the training phase, we also minimize the cross-entropy between gold relation labels and the predicted relations between utterances as:

Discourse Structure Encoding Layer
After the relations are predicted, we construct the discourse structure as a multi-relational graph in which each node indicates an utterance, and each edge represents the relation between a pair of utterances. In order to utilize the discourse information in dropped pronoun recovery process, we first encode the discourse structure, and then utilize the discourse information-based utterance representations to improve token states which are used to model the pronoun referent. Specifically, we apply a multiple relational GCN (Vashishth et al., 2020), referred to as Rel-GCN, over the graph to encode the discourse structure based utterance representations R and utilize the updated representations to further revise syntactic token states H X and H C for outputting discourse structure based token states Z X and Z C . The node states of the graph are initialized as the average aggregation of token states of corresponding utterances. The representation of utterance X i in the (k + 1)-th layer is updated by incorporating the discourse relation state h k rel as: where r k j and h k rel denote the updated representation of utterance j and relation rel after the k-th GCN layers, and W k λ(rel) ∈ R d×d is a relationtype specific parameter. Following the practice of (Vashishth et al., 2020), we take the composition operator φ as multiplication in this work. Please note that we take in the label distribution P rel (X j |X i , C) from the relation prediction layer and compute the weighted sum of each kind of relation to update the utterance representation, rather than taking the hard predicted relation by applying an argmax operation over the distribution.
After encoding the constructed discourse structure with a message passing process, we obtain the discourse relation-augmented utterance representations R, and then utilize the updated utterance representations to revise the syntactic token states with a linear feed-forward network: where h k+1 w i,n refers to the token state of w i,n outputted from the (k + 1)-th layer of SynGCN, r k+1 i refers to the state of the corresponding utterance that the token belongs to, outputted from the (k + 1)-th layer of RelGCN. The operation thus augments syntactic token states H X and H C with discourse information-based utterance representation to obtain discourse context-based token states Z X = (z w 1 , . . . , z wn ) and Z C = (z w 1,i , . . . , z w i,l i ), which will be used to model the referent semantics of the dropped pronoun in the dropper pronoun recovery layer.

Pronoun Recovery Layer
This layer takes in the revised token representations Z X and Z C , and attempts to find tokens in context C that describe the referent of the dropped pronoun in the pro-drop utterance X with an attention mechanism. The referent representation is then captured as the weighted sum of discourse context-based token states as: Then we concatenate the referent representation r w i with the syntactic token representation h k+1 w i to predict the dropped pronoun category as follows: The objective of dropped pronoun recovery aims to minimize cross-entropy between the predicted label distributions and the annotated labels for all sentences as: where Q represents all training instances, l i represents the number of words in pro-drop utterance; δ (y i |w i , C) represents the annotated label of w i .

Training Objective
We train our DiscProReco by jointly optimizing the objective of both discourse relation prediction and dropped pronoun recovery. The total training objective is defined as: loss = α · (loss arc + loss label ) + β · loss dp , (1) where α and β are weights of CDP objective function and DPR objective function respectively.

The SPDPR Dataset
To verify the effectiveness of DiscProReco, we need a conversational corpus containing the annotation of both dropped pronouns and discourse relations. To our knowledge, there is no such a public available corpus. Therefore, we constructed the first Structure Parsing-enhanced Dropped Pronoun Recovery (SPDPR) dataset by annotating the discourse structure information on a popular dropped pronoun recovery dataset (i.e., Chinese SMS).
The Chinese SMS/Chat dataset consists of 684 multi-party chat files and is a popular benchmark for dropped pronoun recovery (Yang et al., 2015). In this study, we set the size of the context snippet to be 8 utterances which include the current pro-drop utterance plus 5 utterances before and 2 utterances after. When performing discourse relation annotation we ask three linguistic experts to independently choose a head utterance for the current utterance from its context and annotate the discourse relation between them according to a set of 8 pre-defined relations (see Appendix A). The inter-annotator agreement for discourse relation annotation is 0.8362, as measured by Fleiss's Kappa. The resulting SPDPR dataset consists of 292,455 tokens and 40,280 utterances, averaging 4,949 utterance pairs per relation, with a minimum of 540 pairs for the least frequent relation and a maximum of 12,252 for the most frequent relation. The SPDPR dataset also annotates 31,591 dropped pronouns (except the "None" category).

Experimental Settings
In this work, 300-dimensional pre-trained embeddings (Li et al., 2018) were input to the BiGRU encoder, and 500-dimensional hidden states were uitilized. For SynGCN and RelGCN, we set the number of GCN layers as 1 and 3 respectively, and augment them with a dropout rate of 0.5. The

Dropped Pronoun Recovery
Datasets and Evaluation Metrics We tested the performance of DiscProReco for DPR on three datasets: (1) TC section of OntoNotes Release 5.0, which is a transcription of Chinese telephone conversations, and is released in the CoNLL 2012 Shared Task.
(2) BaiduZhidao, which is a question answering corpus . Ten types of concrete pronouns were annotated according to the pre-defined guidelines. These two benchmarks do not contain the discourse structure information and are mainly used to evaluate the effectiveness of our model for DPR task.
(3) The SPDPR dataset, which contains 684 conversation files annotated with dropped pronouns and discourse relations. Following practice in (Yang et al., 2015(Yang et al., , 2019, we reserve the same 16.7% of the training instances as the development set, and a separate test set was used to evaluate the models. The statistics of the three datasets are shown in Appendix B. Same as existing efforts (Yang et al., 2015(Yang et al., , 2019, we use Precision(P), Recall(R) and F-score(F) as metrics when evaluating the performance of dropped pronoun models. Baselines We compared DiscProReco against ex-isting baselines, including: (1) MEPR (Yang et al., 2015), which leverages a Maximum Entropy classifier to predict the type of dropped pronoun before each token; (2) NRM , which employs two MLPs to predict the position and type of a dropped pronoun separately; (3) Bi-GRU, which utilizes a bidirectional GRU to encode each token in a pro-drop sentence and then makes prediction; (4) NDPR (Yang et al., 2019), which models the referents of dropped pronouns from a large context with a structured attention mechanism; (5) Transformer-GCRF (Yang et al., 2020), which jointly recovers the dropped pronouns in a conversational snippet with general conditional random fields; (6) XLM-RoBERTa-NDPR, which utilizes the pre-trained multilingual masked language model (Conneau et al., 2020) to encode the pro-drop utterance and its context, and then employs the attention mechanism in NDPR to model the referent semantics.
We also compare two variants of DiscProReco: (1) DiscProReco (XLM-R-w/o RelGCN), which replaces the BiGRU encoder with the pre-trained XLM-RoBERTa model, removes the RelGCN layer, and only utilizes SynGCN to encode syntactic token representations for predicting the dropped pronouns.
(2) DiscProReco(XLM-R) which uses the pre-trained XLM-RoBERTa model as an encoder to replace the BiGRU network in our proposed model. Experimental Results Table 1 reports the results of DiscProReco and the baseline methods on DPR. Please note that for the baseline methods, we directly used the numbers originally reported in the corresponding papers. From the results, we observed that our variant model DiscProReco(XLM- R-w/o RelGCN) outperforms existing baselines on three datasets by all evaluation metrics, which prove the effectiveness of our system as a standalone model for recovering dropped pronouns. We attribute this to the ability of our model to consider long-distance syntactic dependencies between tokens in the same utterance. Note that the results for feature-based baseline MEPR (Yang et al., 2015) on OntoNotes, and BaiduZhidao are not available because several essential features cannot been obtained. However, our proposed DiscProReco still significantly outperforms Dis-cProReco (XLM-R-w/o RelGCN) as it achieved 3.26%, 1.40%, and 1.70% absolute improvements in terms of precision, recall and F-score respectively on SPDPR corpus. This shows that discourse relations between utterances are crucially important for modeling the referent of dropped pronouns and achieving better performance in dropped pronoun recovery. This is consistent with the observation in (Ghosal et al., 2019). The best results are achieved when our model uses uses the pretrained XLM-RoBERTa (i.e., DiscProReco(XLM-R)). Note that discourse relations are not available for Ontonotes and BaiduZhidao datasets and thus we do not have joint learning results for these two data sets. Error Analysis We further investigated some typical mistakes made by our DiscProReco for DPR. Resolving DPR involves effectively modeling the referent of each dropped pronoun from the context to recover the dropped pronoun. As illustrate in Figure 3, both DiscProReco and NDPR model the referent from the context. The former outperforms the latter since it considers the conversation structure that the utterance B3 is a reply to A3 but not an expansion to the utterance B1. However, just modeling the referent from the context is insufficient. In Figure 3, the referent of the dropped pronoun  Table 2: Micro-averaged F-score (%) of conversational discourse parsing on two standard benchmarks.
was correctly identified but the dropped pronoun is mistakenly identified as "(他们/they)". This indicates that the model needs to be augmented with some additional knowledge, such as the difference between singular and plural pronouns.

Conversational Discourse Parsing
Datasets and Evaluation Metrics We evaluated the effectiveness of our DiscProReco framework for CDP task on two datasets as: (1) STAC, which is a standard benchmark for discourse parsing on multi-party dialogue (Asher and Lascarides, 2005). The dataset contains 1,173 dialogues, 12,867 EDUs and 12,476 relations. Same as existing studies, we set aside 10% of the training dialogues as the validation data.
(2) SPDPR, which is constructed in our work containing 684 dialogues and 39,596 annotated relations. Following (Shi and Huang, 2019), we also utilized micro-averaged F-score as the evaluation metric.
Baselines We compared our DiscProReco with existing baseline methods: (1) MST (Afantenos et al., 2015): A approach that uses local information in two utterances to predict the discourse relation, and uses the Maximum Spanning Tree (MST) to construct the discourse structure; (2) ILP (Perret et al., 2016): Same as MST except that the MST algorithm is replaced with Integer Linear Programming (ILP); (3) Deep+MST: A neural network that encodes the discourse representations with GRU, and then uses MST to construct the discourse structure; (4) Deep+ILP: Same as Deep+MST except that the MST algorithm is replaced with Integer Linear Programming (ILP); (5) Deep+Greedy: Similar to Deep+MST and Deep+ILP except that this model uses a greedy decoding algorithm to select the parent for each utterance; (6) Deep Sequential (Shi and Huang, 2019): A deep sequential neural net-work which predicts the discourse relation utilizing both local and global context. In order to explore the effectiveness of joint learning scheme, we also make a comparison of our DiscProReco with its variant, referred to as DiscProReco(w/o DPR), which predict the discourse relation independently, without recovering the dropped pronouns. Experimental Results We list the experimental results of our approach and the baselines in Table 2. For the STAC dataset, we also reported the original results of the STAC benchmark from an existing paper (Shi and Huang, 2019), and apply our DiscProReco to this corpus. For the SPDPR dataset, we ran the baseline methods with the same parameter settings. From the results we can see that the variant of our approach DiscProReco (w/o DPR) outperforms the baselines of discourse parsing. We attribute this to the effectiveness of the biaffine attention mechanism for dependency parsing task (Yan et al., 2020;Ji et al., 2019). However, our approach DiscProReco still significantly outperforms all the compared models. We attribute this to the joint training of the CDP task and the DPR task. The parameter sharing mechanism makes these two tasks benefits each other. Note that the results for the joint model is not available for STAC as STAC is not annotated with dropped pronouns.

Interaction between DPR and CDP
We also conducted experiments on SPDPR to study the quantitative interaction between DPR and CDP. Firstly, during the training process, we optimize our DiscProReco model utilizing the objective function in Eq. 1 until the CDP task achieves a specific Fscore (i.e., gradually increases from 30.64 to 50.38). Then we fix the CDP components and continue to optimize the components of DPR task. We conduct this experiment to explore the influence of CDP task on the DPR task. Secondly, we set the ratio between α and β in Eq. 1 varies from 0.25 to 1.25 and record the F-score of DPR and CDP respectively. We conduct this experiment to study the interanction between these two tasks by modifying their weights in the objective function.
Results of these two experiments are shown in Figure 4. According to Figure 4 (a), the performance of DPR is increased in terms of all evaluation metrics as the F-score of CDP increases, which indicates that exploring the discourse relations between utterances benefits dropped pronoun recovery. Moreover, Figure 4 (b) illustrate the performance of DPR and CDP when the ratio between α to β varies gradually. Results show that the performance of CDP remains stable, while the performance of DPR increases at beginning and then decrease sharply as the ratio increases, indicating that DiscProReco framework should pay more attention to DPR during the optimizing process.

Related Work
Dropped pronoun recovery is a critical technique that can benefit many downstream applications (Wang et al., 2016Su et al., 2019). Yang et al. (2015) for the first time proposed this task, and utilized a Maximum Entropy classifier to recover the dropped pronouns in text messages. Giannella et al. (2017) further employed a linear-chain CRF to jointly predict the position and type of the dropped pronouns in a single utterance using hand-crafted features. Due to the powerful semantic modeling capability of deep learning, ; Yang et al. (2019) introduced neural network methods to recover the dropped pronoun by modeling its semantics from the context. All these methods represent the utterances without considering the relationship between utterances, which is important to identify the referents. Zero pronoun resolution is also a closely related line of research to DPR (Chen and Ng, 2016;Yin et al., 2017Yin et al., , 2018. The main difference between DPR and zero pronoun resolution task is that DPR considers both anaphoric and non-anaphoric pronouns, and doesn't attempt to resolve it to a referent. Existing discourse parsing methods first predicted the probability of discourse relation, and then applied a decoding algorithm to construct the discourse structure (Muller et al., 2012;Li et al., 2014;Afantenos et al., 2015;Perret et al., 2016). A deep sequential model (Shi and Huang, 2019) was further presented to predict the discourse dependencies utilizing both local information of two utterances and the global information of existing constructed discourse structure. All these methods consider how to do relation prediction independently. However, in this work, we explore the connection between the CDP and DPR, and attempt to make these two tasks mutually enhance each other.

Conclusion
This paper presents that dropped pronoun recovery and conversational discourse parsing are two strongly related tasks. To make them benefit from each other, we devise a novel framework called DiscProReco to tackle these two tasks simultaneously. The framework is trained in a joint learning paradigm, and the parameters for the two tasks are jointly optimized. To facilitate the study of the problem, we created a large-scale dataset called SPDPR which contains the annotations of both dropped pronouns and discourse relations. Experimental results demonstrated that DiscProReco outperformed all baselines on both tasks.

Relation
Description Different Participant Agreement a participant provides a response to a previous request or suggestion Understanding a participant indicates understanding of a previous utterance Directive a participant asks another one to do something Question a general request for another participant Answer a participant provides the information requested by another participant Feedback a participant responds to another speaker's utterance Same Participant Expansion a participant provides an elaboration of a previous utterance Contingency a participant continues to say something else

A Discourse Relations
The discourse relation describes a participant may speak a utterance to agree with, respond to, or indicate understanding of another utterance in the conversational context. According to (Xue et al., 2016) , each utterance is assumed only related to one previous utterance. All relations are summarized as 6 types between same-participant utterance pairs, and 2 types between different-participant utterance pairs, as summarized in Table 3.

B Statistics of DPR Datasets
The statistics of three dropped pronoun recovery benchmarks (i.e., SPDPR, TC section of OntoNotes and BaiduZhidao) are shown in Table 4.