Exophoric Pronoun Resolution in Dialogues with Topic Regularization

Resolving pronouns to their referents has long been studied as a fundamental natural language understanding problem. Previous works on pronoun coreference resolution (PCR) mostly focus on resolving pronouns to mentions in text while ignoring the exophoric scenario. Exophoric pronouns are common in daily communications, where speakers may directly use pronouns to refer to some objects present in the environment without introducing the objects first. Although such objects are not mentioned in the dialogue text, they can often be disambiguated by the general topics of the dialogue. Motivated by this, we propose to jointly leverage the local context and global topics of dialogues to solve the out-of-text PCR problem. Extensive experiments demonstrate the effectiveness of adding topic regularization for resolving exophoric pronouns.


Introduction
Grounding pronouns to objects they refer to is a challenging yet crucial natural language understanding problem. The coreference relationship between a pronoun and its referents is categorized into endophora and exophora based on whether the referred objects appear in text or out of text, and the former case can be further divided into anaphora if the referents appear in the preceding text of the pronoun and cataphora if in the following text (Halliday and Hasan, 1976;Brown and Yule, 1983). Conventional studies on the pronoun coreference resolution (PCR) task in the NLP community mainly focus on anaphora (Hobbs, 1978;NIST, 2003;Pradhan et al., 2012) and some recent work analyzes cataphora in machine translation (Wong et al., 2020), while mostly ignoring the exophoric pronouns. However, in daily dialogues or conversations, speakers may often use exophoric pronouns to refer to objects in the situational context that all speakers and listeners are aware of without introducing them in the first place.

Topic: cooking and eating
Cleaning service? I want to cook but the kitchen is in a mess. When is the house cleaning service scheduled?
It is scheduled at 8 p.m..
Well, I am starving right now. Could you order that like last Friday? This limits the use of current PCR models in many real-world dialogue/conversation scenarios, e.g., text interpretation (Hankamer and Sag, 1976;Yule, 1979) and downstream tasks such as dialogue generation (Kottur et al., 2018;Niu et al., 2019). Figure 1 shows an example of exophora. A person talks with his AI assistant (Siri/Alexa), "Could you order that like last Friday?" In this scenario, "that" is an exophoric pronoun whose referent can not be found in the dialogue text. A smart enough AI system should be able to resolve the pronoun "that" to some food rather than cleaning service based on the context. Such resolution of exophora is a crucial step in natural language understanding for the AI dialogue system to generate meaningful and relevant responses.
Since traditional PCR tasks only focus on endophoric pronouns while ignoring exophoric ones, all existing models struggle when the correct referent is not in the textual context of the target pronoun. For example, most of the human-defined rules (Hobbs, 1978) (e.g., "them" can only refer to plural objects) and features (Ng, 2005) (e.g., the distance between the target pronoun and candidate noun phrase) become either less effective or inapplicable in the exophoric setting. Unlike humandesigned patterns or feature-based methods, the end-to-end coreference models (Lee et al., 2018;Joshi et al., 2019) have the potential of resolving pronouns to external objects as long as the names of objects are provided as candidates. Nonetheless, these models heavily rely on the representation of local context produced by deep models so they always tend to resolve pronouns to the mentions in near text. As Figure 1 shows, the models could easily be distracted by the noun phrase "cleaning service" in text and resolve "that" to the service.
To address the limitations of current models, we propose to take the overall dialogue topics into consideration. For the example in Figure 1, we can judge from the whole dialogue that the topic is about cooking and eating, so it is likely that the person needs some food. If the AI system correctly resolves "that" to the topic-related out-of-text object "meal," this may help the AI assistant to finally give a reasonable response, "I will order the takeaway that you had last Friday." To quantitatively define and evaluate exophora resolution, we leverage the VisPro dataset (Yu et al., 2019), which annotates PCR information on visual dialogues. It is the only PCR dataset with annotations of out-of-text referents to the best of our knowledge. While the original dataset provides images alongside dialogues, we observe that humans can resolve 96% of exophoric pronouns in VisPro with only dialogue texts, which perfectly matches our research goal. Therefore, we perform out-of-text PCR experiments on the texts of VisPro.
In this paper, we define the out-of-text PCR task and present a model, which jointly leverages the local context and global topics to better resolve pronouns to out-of-text objects. The model first identifies the overall dialogue topics and then assign larger scores to objects which are more relevant to the topics. By doing so, it less overfits the local context and learns to resolve pronouns based on global topics. Experimental results prove that the proposed model can significantly boost the performance of resolving exophoric pronouns without sacrificing the performance on in-text PCR. We also conduct an extensive analysis to show the contribution of different components. The data, code, and models are available at: https://github. com/HKUST-KnowComp/Exo-PCR.

Related Works
Coreference resolution is the task of identifying coreference relations among different mentions. As a vital natural language understanding component, a good coreference system could benefit many downstream tasks such as machine translation (Guillou, 2012;Wong et al., 2020), dialog systems (Strube and Müller, 2003), question answering (Dasigi et al., 2019), and summarization (Steinberger et al., 2007). Due to the weak semantic meaning of pronouns (Ehrlich, 1981), grounding pronouns to their referents (PCR) has been specially studied as a more challenging task than the general coreference resolution (Mitkov, 1998;Ng, 2005).
Previous PCR studies (Ng, 2005; mostly focus on resolving pronouns to mentions in the near context. However, in informal text such as daily dialogues, it is common that pronouns may refer to out-of-text objects, which is crucial for dialogue understanding. Such pronouns have long been discussed as "pragmatically controlled anaphora" in linguistics (Hankamer and Sag, 1976;Yule, 1979;Brown and Yule, 1983), but there has been few discussion of exophoric pronouns in the NLP community. Hangyo et al. (2013) deal with exophora of zero pronouns, a special phenomenon in Japanese where an omitted argument of a predicate might refer to the "author" or the "reader" of the document. Aktacs et al. (2018) qualitatively analyze exophoric reference in twitter conversations, where the antecedent of a pronoun could appear in the attached media or the quoted tweet. Unlike previous works, we follow a more general linguistics definition of exohpora (Halliday and Hasan, 1976) and evaluate it quantitatively. One recent work (Yu et al., 2019) annotates a dataset VisPro containing in-text and out-of-text referents for pronouns in Visual Dialog (Das et al., 2017), and solve the PCR task by involving visual information. In this work, we propose to resolve exophora in Vis-Pro with texts as the only input. Our model jointly uses local context and global topic information for exophora resolution, which does not require the support of visual signals and thus can be applied to all scenarios.

The Task
In this section, we introduce details about the dataset construction and the task definition.

Dataset Setting
We construct the exophoric PCR dataset on top of VisPro (Yu et al., 2019), which is the only dataset that provides rich exophoric pronoun annotations to the best of our knowledge. Although the original re-  Figure 2: An example of the task. Pronouns are linked with their in-text and out-of-text referents. Exophoric pronouns, endophoric pronouns, and their referents are marked with different colors. The topic words are predicted by an LDA model. search focus of VisPro is to study the importance of visual information in resolving pronouns in visualrelated dialogues, we observe that in many cases, the dialogue text is enough for humans to make the correct resolution. Take Figure 2 as an example. In the dialogue text, the pronoun "he" is exophoric because the referred person is not mentioned explicitly in the dialogue. Even without the image, we can still guess that the dialogue is about a baseball game from clues like "base" and "batting glove," and thus the pronoun "he" is more likely to refer to "baseball player" rather than other candidates.
Quantitatively, we randomly select 100 exophoric pronouns in the development set of VisPro and find that 96% of them can be correctly resolved without the visual information. Therefore, VisPro can be used as a valid dataset for the exophoric pronoun resolution task. A more detailed analysis is provided in Appendix A.

Task definition
In this work, we focus on resolving pronouns to mentions inside dialogues and objects outside dialogues simultaneously.
Given a pronoun p in a dialogue D, we first select the noun phrases previous to p in dialogue as candidates M for in-text referents. For example, the noun phrases "base" and "a batting glove" are candidates of antecedents for "them" in Figure 2.
To provide candidates for out-of-text referents for each dialogue, (Yu et al., 2019) randomly selects 30 noun phrases from image captions. However, such a setting is impractical when no caption is available (details are discussed in Appendix A). As exophoric pronouns may refer to any object in daily life, we collect all the objects that frequently appear in the situational context of dialogues in VisPro to form an object pool O. The object pool contains 384 common object categories such as "hat" and "glove" shown in Figure 2. The details of the collection are described in Sec 5.4. The goal of the task is to identify the correct antecedents in M and the correct out-of-text objects from O by minimizing the loss: where L i is the loss function for the in-text coreference resolution and L o for the out-of-text resolution. We then define them following the coreferenc loss in the end-to-end in-text coreference models (Lee et al., 2018): in which F (·) is the coreference score of pronouns p with mentions m or objects o, and C m and C o denote the correct referents in M and O, respectively. For instance, for the pronoun "them" in Figure 2, the model is required to not only recognize its antecedent in text to be "a batting glove" but also link it to "glove" in the external object pool.

The Model
The goal of the coreference model is to provide the coreference score F (p, d) between a pronoun p and a candidate d, which can either be a mention m ∈ M or an external object o ∈ O. We divide the coreference score into three parts: the similarity score between p and d based on local context, the global topic relevance score of p, and that of d: . (3) Specifically, F l calculates the similarity between p and d via local context representations, while F g acquires the relevance score between each text span and the global topics.
To capture the topic information of the dialogues, we employ topic prediction as an auxiliary task of the model. The overall model architecture is shown in Figure 3 and details are as follows.

Local Similarity Score
Following (Joshi et al., 2019;Lee et al., 2018;Bahdanau et al., 2015), for each span s, which could be either p, m, or o and contains T words x 1 , x 2 , ..., x T , we first extract word embeddings from pre-trained language models as  Figure 3: There are three main components in the proposed model: local similarity score calculation, global relevance score calculation, and topic prediction. The local score module calculates the similarity between a pronoun p and a candidate span d based on their textual representation. The global score module measures their relevance with the global dialogue topic. To help the topic embedding capture the topic information better, the topic prediction module uses the dialogue embedding to fit the topic vector predicted by LDA as an auxiliary task.
{x 1 , x 2 , ..., x T }. Then, we represent each span with the combination of the embeddings of the first token (x 1 ), the last token (x T ), the weighted sum of embeddings of all tokens in it (x), and the length feature of the span (φ(s)): in whichx .
Here [·, ·] indicates the concatenation operation and NN the feed forward neural network. After acquiring the features of the spans, we then calculate the local similarity score between a pronoun p and a candidate span d as: where denotes the element-wise multiplication.

Global Relevance Score
Although the out-of-text referents of exophoric pronouns are not mentioned in the text, they can be inferred from the dialogue context. As the subject of dialogue context, the dialogue topics play a vital part in exophora resolution. For the daily dialogue example in Figure 1, we can infer from context words such as "cook," "kitchen," and "starving" that the dialogue topic is about cooking and eating, so the exophoric pronoun "that" is more likely to refer to "meal" rather than "cleaning service." Similarly, in the VisPro example in Figure 2, if we only read the sentence containing "he," it is hard to infer the targeting object of "he" to be a baseball player, a tennis player, or a football player. On the contrary, if we consider the whole dialogue as context, we can recognize the topic to be a baseball game, in which a man "wearing a batting glove" is "running towards" a "base." Therefore, we can judge that this man must be a baseball player rather than a football or tennis player, so the exophoric pronoun "he" refers to the out-of-text object "baseball player." Based on the above observations, we leverage the overall dialogue topic to help grounding pronouns to out-of-text objects. To effectively encode the topic information of the whole dialogue, we first obtain the overall embedding e D of a dialogue D with pre-trained language models. For LSTMbased models, we take the average embedding of all sentences as e D . For BERT-based models, we take the embedding of the special token [CLS]. Then we pass it through a feed forward neural network to obtain the dialogue topic embedding: After that, to indicate the relevance between a span s and the global topic of the dialogue, we calculate the topic relevance score as:  In the end, we calculate the final coreference scores of pronouns p with in-text mentions m and out-oftext objects o as: With global relevance scores, models trained with VisPro are able to resolve exophoric pronouns based on dialogue topics. In real-life scenarios such as Figure 1, the key for understanding exophora is also the relevance between out-of-text objects and dialogue context. Thus the ability to resolve exophora with dialogue topics can also be transferred to such realistic cases.

Topic Prediction as Regularization
To help the topic embedding e tp better represent the topic information of the dialogue, we propose to use topic prediction as an auxiliary task.
We first obtain the topic labels of dialogues by the most commonly used unsupervised topic model Latent Dirichlet Allocation (LDA) (Blei et al., 2001). The LDA model extracts n tp topics from dialogues in the training set and represents each topic as a list of words with a high probability to appear under the topic. Table 1 presents some topics of VisPro dialogues extracted by the LDA model. From the topic words, we can summarize that the No.15 topic is about cars in streets and that the No.25 topic discusses a kitchen. The topic label p D of a dialogue D can be defined as: where the j th dimension ofp D represents the probability of the dialogue corresponding to the No.j topic. For instance, the LDA model predicts that the dialogue in Figure 2 belongs to the No.23 topic in Table 1 with 60% possibility and thus the 23 th dimension ofp D is 0.6. Asp D sums up to 1 and each dialogue could associate with several topics, we fitp D by e tp with a L2 loss after a softmax function 1 : We use the topic prediction loss as a regularization term to the total loss: where L i and L o are defined in (2). As a result, the final loss function L can be optimized in an end-to-end manner.

The Experiment
In this section, we introduce the experiment details.

Dataset
We use VisPro (Yu et al., 2019) as the dataset, which contains 4,000 train, 500 development, and 500 test dialogues. The train, development, and test sets of VisPro contain 13,686, 1,726, and 1,781 pronouns with out-of-text referents and 13,986, 1,742, and 1,756 pronouns with in-text antecedents, respectively.

Evaluation Metrics
We use different metrics for in-text and out-of-text PCR due to the different numbers of candidates. For the in-text PCR, each pronoun has 10.3 candidates and 1.6 correct referents on average. Thus we follow the previous work (Yu et al., 2019) to employ Precision (P), Recall (R), and F1 score as the evaluation metrics. For the out-of-text PCR, as all 384 common object nouns are candidates and only one of them is correct, the F1 score is no longer suitable. For example, if the model predicts the correct answer to be the second place out of 384 candidates, it means that model can somehow understand the pronoun, while the F1 metric will count it as wrong. Therefore, we view out-of-text PCR as a ranking problem, where objects that a pronoun refers to should have a higher rank, and evaluate all models by the recall at 1, 5, and 10.

Baselines
We add our global relevance score module and topic prediction module on basis of the following  Figure 4: An example of the object category "baseball player." Each object category contains its synonyms, hypernyms, hyponyms, and corresponding noun phrases.
end-to-end coreference resolution models which only contains the local similarity score module 2 : • End-to-end model with LSTM based on ELMo embedding (Lee et al., 2018) , which extracts features by a BiLSTM upon ELMo embeddings.
• End-to-end model based on SpanBERT embedding (Joshi et al., 2020) , which can better represent text spans.

Implementation
Dataset Processing: To collect common object categories in VisPro, we first map 2,600 noun phrases annotated as out-of-text referents in VisPro to a compact list of 384 object categories by removing all modifiers and merging similar phrases. For instance, pronouns referring to "a male baseball player" or "a local baseball team" are both mapped to the object "baseball player." Moreover, some objects have similar or overlapping meanings with other objects (e.g., "pond" similar to "pool") but only one is labeled as the gold answer of a pronoun. It would be problematic if we directly label all others as wrong. To solve this problem, we use the synonyms, hypernyms, and hyponyms obtained from synset in Wordnet (Miller, 1995) in NLTK (Bird, 2006) as extra information attached to each object category. If a pronoun refers to a particular object in the external object pool, then the synonyms, hypernyms, and hyponyms of the targeting object are masked during the training and testing process. An example of an object category "baseball player" is shown in Figure 4. Note that other person categories which are not a synonym, hypernym or hyponym of "baseball player", such as "tennis player" and "football player", are not masked. Last but not least, we split the pronouns with out-of-text referents by whether a pronoun simultaneously refers to some mentions in the dialogue. If a pronoun has both in-text and out-of-text referents, such as "them" in Figure 1, which refers to "a batting glove" in the dialogue as well as "glove" in the object pool, we denote it as "Discussed" in the dialogue. If a pronoun only has out-of-text referents, such as "he" in Figure 1, which only refers to the object "baseball player," we denote it as "Not Discussed" in the dialogue. While "Not Discussed" pronouns strictly match the definition of exophora, grounding the "Discussed" pronoun to out-of-text objects is also an important step towards linking dialogue text to the environment. In VisPro, 25.02% of all pronouns with out-of-text referents are "not discussed." Training Details: We follow the hyperparameters set in (Joshi et al., 2019). The number of topics n tp is set to 40 for LDA. The topic prediction module in the model contains one hidden layer of size 1,000. Gold mentions are provided for training and testing of the models. During testing, the in-text antecedents are chosen in the same way as (Lee et al., 2018). For the out-of-text part, objects o with scores F (p, o) > 0 are deemed as the prediction of out-of-text referents for the pronoun p and the selected objects are ranked according to the scores. Models are trained for ten epochs, and the best ones are selected based on their performance on the development set.

The Results
From the experimental results in Table 2, we can observe that BERT and SpanBERT based models outperform ELMo-LSTM based models, which is consistent with the observation in (Joshi et al., 2019) mainly because of their stronger context representation ability. On top of them, incorporating global topics improves recall for both exophoric and endophoric pronouns. Last but not least, for in-text PCR, adding topic information only slightly influences the precision while significantly improving the recall. As a result, it also achieves better overall F1 performance.
Further analyzing the performances of models on out-of-text PCR, we observe that the "Not Dis-   Table 3: Recall of "Not Discussed" pronouns in the test set referring to "Infrequent" and "Frequent" objects.
cussed" pronouns are more challenging than the "Discussed" group for all models. This makes sense because if a pronoun refers to some noun phrases in text, the embedding of the pronoun will encode the information of those noun phrases via the language models. For instance, if the representation of "them" in Figure 2 encodes the context "a batting glove," it would be easier to identify the semantically related object "glove" as the out-of-text referent. In contrast, "Not Discussed" pronouns do not have any noun phrase antecedent in the dialogue and are thus more challenging. In such cases, the effect of incorporating global semantics becomes more significant than in "Discussed" cases. In the rest of this section, we present a detailed analysis with the BERT-base + topic model, which achieves the highest performance on "Not Discussed" pronouns and comparable performances on other settings, to show when our model performs well and when it fails.

Influence of Frequency
In the external object pool, the appearances of different objects varies. For instance, "man" appears 3,084 times in the training set, while "monkey" only appears once. To investigate the influence of such imbalance, we split the object list by their occurrence frequency, with occurrence less than 50 times as "Infrequent" objects, which make up 85.1% of list, and the rest as "Frequent" objects. As observed in Table 3, performances on infrequent objects are much lower than frequent ones, which indicates that although the models achieve high scores on frequent objects, they still fail to do well on the majority of relatively rare objects. This observation also shows that the exophoric PCR problem is still far from being solved. Compared to models focusing on local information, the proposed model, which incorporates the overall topics, boosts the performance by a large margin, especially on infrequent pronouns.

Influence of Object Categories
Besides the influence of frequency, we are also interested in how well our model can perform on different object categories. We record the performance of pronouns related to the four most common categories 3 (person, animal, vehicle, and food) in Figure 5, from which we can see that pronouns related to "person" and "animal" are most common and easiest to be resolved, which is consistent with our previous observation that our model performs better on frequent objects than on infrequent ones.

Ablation Study
We present the ablation study in Table 4, from which we can see that all components contribute to the ultimate success. For example, performance drops when removing the topic prediction loss as regularization, which indicates that the topic prediction module can help the embedding of the dialogue to capture the topic information better. Besides that, if we do not mask out the synonyms, hypernyms, and hyponyms of the object categories during training, the performance drops dramatically. It shows the importance of masking possible distractions to provide unique labels during training. Last but not least, one contribution of the proposed model is the joint training of both the in-text and out-of-text PCR and, the results show that removing either of them in the training process will result in a performance drop on both tasks. Similar improvement by joint training is also observed in (Bai et al., 2021), where the in-text PCR task is jointly trained with the character linking task that links the endophoric pronouns in TV show scripts to the characters.  Figure 6: Case study for out-of-text PCR. Target pronouns and correct out-of-text objects with their hints are marked in different colors. Note that we only show the corresponding images here for clarity and that they are not provided to the models.

BERT-base VS BERT-large
tion consistently improves performance on outof-text PCR for all models while achieving comparable scores on the in-text one. Besides, we surprisingly find out that compared to BERT-base and SpanBERT-base, even though BERT-large and SpanBERT-large achieve higher scores on in-text PCR, their performance on the out-of-text PCR slightly drops. An explanation is that they may easily overfit the local context and ignore the global topic information due to their deep model. Figure 6 shows a dialogue about a male surfer. The referents of the pronoun "he" is "Not Discussed" in the dialogue text. The model that can only access the local context cannot identify any object related to the pronoun. In contrast, the model with topic prediction assigns a high probability of 74.3% for the topic of the dialogue to be surfing judging from the related words such as "wave" and "board." Thus it identifies "surfer" as the out-of-text referent for the pronoun. More cases are shown in Appendix B.

Error Analysis
We first quantitatively study the error types of the BERT-base + topic model by randomly selecting 60 mistaken predictions in out-of-text PCR, including 30 cases for the "Not Discussed" pronouns and 30 for the "Discussed" ones. We observe that 1/3 of the cases are also difficult for humans to identify the correct objects without access to the correspond- ing images. This is either because the dialogue text does not contain enough clues to infer the right answer, or multiple answers are reasonable but only one is annotated. For the other 2/3 cases, Figure 7 shows that more than half of errors are still from overfitting to local context and 10% from failure to use the topic information. Other 23% errors come from failure to associate pronouns with infrequent objects as discussed in Section 6.1 and the rest 13% are due to the lack of required knowledge. Error analysis demonstrates that the model can be further improved by avoiding overfitting to the local context and incorporating explicit knowledge. Some erroneous cases are provided in Appendix C.

Conclusion
In this paper, we focus on grounding pronouns in dialogues to out-of-text objects. We propose to incorporate the topics of the dialogues to help the PCR model identify the out-of-text referents better. Experiments show that the proposed model outperforms previous models on both in-text and out-of-text PCR tasks. Detailed analysis is presented to show the strength and limitations of the proposed model. While this work is a first step to explore exophora resolution on one dataset, future work may explore exophora resolution in different domains such as AI chat-bots for home assistants.

A Task Definition Compared to Prior Works
Our experiments are based on the dataset Vis-Pro (Yu et al., 2019), which provides annotation of referents for pronouns in dialogues of the Visual Dialog dataset (Das et al., 2017). Figure 8 illustrates the different settings of our work compared to prior works.
In the original setting of Visual Dialog dataset (Figure 8(a)), each dialogue happens between two people chatting about an image, and each image is accompanied by a descriptive caption. Speaker A only has access to the caption and attempts to imagine the image by asking questions, while speaker B can access both the image and the caption and answers the questions. Thus the pronouns in the dialogues refer to either mention in the dialogue text or noun phrase in captions.
In the setting of VisPro (Figure 8(b)), to simulate the scenario where people use pronouns to directly refer to objects in the environment, the captions are separated from the dialogues. As the captions are descriptions of images, the mentions in captions must correspond to some objects in the images. Thus, when captions are no longer available, the pronouns that refer to noun phrases in captions can be deemed as referring to objects in the images.
Although VisPro first proposed the scenario where pronouns refer to out-of-text objects, it focused on visual-related cases and did not associate such cases with the general definition of exophora. Furthermore, in the definition of the visual pronoun coreference resolution task that Yu et al. (2019) proposed, the candidates of the out-of-text objects are 30 noun phrases randomly selected from captions. This small set of objects contains noun phrases from the corresponding caption as well as captions of other images to provide negative samples. However, such a setting is not so practical. For one thing, the out-of-text candidates are not fixed among different dialogues and the choices for negative samples are random, which makes it hard to compare between multiple dialogues. For another, such noun phrases are hard to obtain in practical cases where no caption for the environment is available, so the model trained under this task cannot be applied to dialogues outside the dataset.
Based on the annotation of VisPro, we design a more practical experiment setting (Figure 8(c)). First, we assume that the visual background of dialogues is not always available, and thus aim to resolve exophoric pronouns based on only the dialogue text. Second, since exophoric pronouns might refer to any object in daily life, we collect all the common objects in VisPro to form a candidate pool of 384 object categories. Since the candidate pool is fixed for all dialogues, we can reasonably compare the performance between different dialogues and models. The model trained under our setting can thus be applied to real-life dialogues.

B Case Study for Out-of-Text PCR
We randomly select some cases from the test split of VisPro and present them in Figure 9. Cases (a)-(d) are "Not Discussed" pronouns which only have out-of-text referents, and cases (e)(f) are "Discussed" pronouns which have both out-of-text and in-text referents. For the "Discussed" pronouns in (e)(f), even though the referred objects are mentioned in the text, the BERT-base model still overfits to distracting words and gives the false prediction "person." On the contrary, our model leverages the topic information and predicts the correct objects. Figure 10 presents some typical erroneous cases. In Figure 10(a), the model predicts that "they" refers to "person" instead of "sheep," which hits three error types. First, the topic model correctly infers that the dialogue is about some animals on grass but the coreference model ignores this information. Second, based on the word "sheared" and knowledge that sheep need to be sheared, humans can infer that the pronoun refers to "sheep." However, the model fails to learn such knowledge from the pre-training of the language model. Last, the prediction of "person" indicates that it overfits to the word "people" in dialogue text even though it says that there are 0 people. Figure 10(b) shows another case where the model fails to recall the knowledge that only a person could wear a ring or a watch and thus fail to infer that "he" refers to a person.