Joint Coreference Resolution and Character Linking for Multiparty Conversation

Character linking, the task of linking mentioned people in conversations to the real world, is crucial for understanding the conversations. For the efficiency of communication, humans often choose to use pronouns (e.g., “she”) or normal entities (e.g., “that girl”) rather than named entities (e.g., “Rachel”) in the spoken language, which makes linking those mentions to real people a much more challenging than a regular entity linking task. To address this challenge, we propose to incorporate the richer context from the coreference relations among different mentions to help the linking. On the other hand, considering that finding coreference clusters itself is not a trivial task and could benefit from the global character information, we propose to jointly solve these two tasks. Specifically, we propose Cˆ2, the joint learning model of Coreference resolution and Character linking. The experimental results demonstrate that Cˆ2 can significantly outperform previous works on both tasks. Further analyses are conducted to analyze the contribution of all modules in the proposed model and the effect of all hyper-parameters.


Introduction
Understanding conversations has long been one of the ultimate goals of the natural language processing community, and a critical step towards that is grounding all mentioned people to the real world. If we can achieve that, we can leverage our knowledge about these people (e.g., things that happened to them before) to better understand the conversation. On the other hand, we can also aggregate the conversation information back to our understanding about these people, which can be used for understanding future conversations that involve the same people. To simulate the real conversations and investigate the possibility for models to ground mentioned people, the character linking task was 72% 16% 12% Pronouns Personal Nouns Named Entities Figure 1: The composition of the mentions in conversations for character grounding. Over 88% of the mentions are not named entities, which brings exceptional challenges when linking those to character entities.
proposed (Chen and Choi, 2016). Specifically, it uses the transcripts of TV shows (i.e., Friends) as the conversations and asks the models to ground all person mentions to characters.
Even though the character linking task can be viewed as a special case of the entity linking task, it is more challenging than the ordinary entity linking task for various reasons. First, the ordinary entity linking task often aims at linking named entities to external knowledge bases such as Wikipedia, where rich information (e.g., definitions) are available. However, for the character linking task, we do not have the support of such rich knowledge base and all we have are the names of these characters and simple properties (e.g., gender) about these characters. Second, the mentions in the ordinary entity linking are mostly concepts and entities, but not pronouns. However, as shown in Figure 1, 88% of the character mentions are pronouns (e.g., "he") or personal nouns (e.g., "that guy") while only 12% are named entities.
Considering that pronouns have relatively weak semantics by themselves, to effectively ground mentions to the correct characters, we need to fully utilize the context information of the whole conversation rather than just the local context they appear in. One potential solution is using the coreference relations among different mentions as the bridge to Monica : There's nothing to tell! He 's just some guy I work with! Joey : C'mon, you're going out with the guy ! There's gotta be something wrong with him ! Ross: All right Joey, be nice. So does he have a hump? A hump and a hairpiece? Paul the Wine Guy Figure 2: Coreference clusters can help to connect the whole conversation to provide a richer context for each mention such that we can better link them to Paul. Meanwhile, the character Pual can also provide global information to help resolve the coreference.
connect the richer context. One example is shown in Figure 2. It is difficult to directly link the highlighted mentions to the character Paul based on their local context because the local context of each mention can only provide a single piece of information about its referent, e.g., "person is a male" or "the person works with Monica." Given the coreference cluster, the mentions refer to the same person, and the pieces of information are put together to jointly determining the referent. As a result, it is easier for a model to do character linking with resolved coreference. Similar observations are also made in (Chen et al., 2017).
At the same time, we also noticed that coreference resolution, especially those involving pronouns, is also not trivial. As shown by the recent literature on the coreference resolution task Kantor and Globerson, 2019), the task is still challenging for current models and the key challenge is how to utilize the global information about entities. And that is exactly what the character linking model can provide. For example, in Figure 2, it is difficult for a coreference model to correctly resolve the last mention he in the utterance given by Ross based on its local context, because another major male character (Joey) joins the conversation, which can distract and mislead the coreference model. However, if the model knows the mention he links to the character Paul and Paul works with Monica, it is easier to resolve he to some guy that Monica works with.
Motivated by these observations, we propose to jointly train the Coreference resolution and Character linking tasks and name the joint model as C 2 . C 2 adopts a transformer-based text encoder and includes a mention-level self-attention (MLSA) module that enables the model to do mention-level contextualization. Meanwhile, a joint loss function is designed and utilized so that both tasks can be jointly optimized. The experimental results demonstrate that C 2 outperforms all previous work significantly on both tasks. Specifically, compared with the previous work (Zhou and Choi, 2018), C 2 improves the performance by 15% and 26% on the coreference resolution and character linking tasks 1 respectively comparing to the previous state-of-the-art model ACNN (Zhou and Choi, 2018) . Further hyper-parameter and ablation studies testify the effectiveness of different components of C 2 and the effect of all hyper-parameters. Our code is available at https: //github.com/HKUST-KnowComp/C2.

Problem Formulations and Notations
We first introduce the coreference resolution and character linking tasks as well as used notations. Given a conversation, which contains multiple utterances and n character mentions c 1 , c 2 , ..., c n , and a pre-defined character set Z, which contains m characters z 1 , z 2 , ..., z m . The coreference resolution task is grouping all mentions to clusters such that all mentions in the same cluster refer to the same character. The character linking task is linking each mention to its corresponding character.

Model
In this section, we introduce the proposed C 2 framework, which is illustrated in Figure 3. With the conversation and all mentions as input, we first encode them with a shared mention representation encoder module, which includes a pre-trained transformer text encoder and a mention-level selfattention (MLSA) module. After that, we make predictions for both tasks via two separate modules. In the end, a joint loss function is devised so that the model can be effectively trained on both tasks simultaneously. Details are as follows.

Mention Representation
We use pre-trained language models (Devlin et al., 2018;Joshi et al., 2019a) to obtain the contextualized representations for mentions. As speaker information is critical for the conversation understanding, we also include that information by appending speaker embeddings to each mention. As a result, the initial representation of mention i is: where t start i and t end i are the contextualized representation of the beginning and the end tokens of mention i, and the e speaker i is the speaker embedding for the current speaker. Here, we omit the embeddings of inner tokens because their semantics has been effectively encoded via the language model. The speaker embeddings are randomly initialized before training. Sometimes the local context of a mention is not enough to make reasonable predictions, and it is observed that the co-occurred mentions can provide document-level context information. To refine the mention representations given the presence of other mentions in the document, we introduce the Mention-Level Self-Attention (MLSA) layer, which has n layers of transformer encoder structure (Vaswani et al., 2017) and is denoted as T . Formally, this iterative mention refinement process can be described by where k indicates the number of mentions in a document, and the g (i) means the mention representation from the i-th layer of MLSA.

Coreference Resolution
Following the previous work (Joshi et al., 2019a), we model the coreference resolution task as an antecedent finding problem. For each mention, we aim at finding one of the previous mentions that refer to the same person. If no such previous mention exists, it should be linked to the dummy mention ε. Thus the goal of a coreference model is to learn a distribution, P (y i ) over each antecedent for each mention i: where s(i, j) is the score for the antecedent assignment of mention i to j. The score s(i, j) contains two parts: (1) the plausibility score of the mentions s a (i, j); (2) the mention score measuring the plausibility of being a proper mention s m (i). Formally, the s(i, j) can be expressed by where g (n) stands for the last layer mention representation resulted from the MLSA and F F N N indicates the feed-forward neural network.

Character Linking
The character linking is formulated as a multi-class classification problem, following previous work (Zhou and Choi, 2018). Given the mention representations g (n) , the linking can be done with a simple feed-forward network, denoted as F F N N (·). Specifically, the probability of character entity z i is linked with a given mention i can be calculated by: where the notation (.) z represents the z-th composition of a given vector.

Joint Learning
To jointly optimize both coreference resolution and entity linking, we design a joint loss of both tasks. For coreference resolution, given the gold clusters, we minimize the negative log-likelihood of  the possibility that each mention is linked to a gold antecedent. Then the coreference loss L c becomes where the GOLD(i) denotes the gold coreference cluster that mention i belongs to. Similarly, for character linking, we minimize the negative loglikelihood of the joint probability for each mention being linked to the correct referent character: Finally, the joint loss can be the arithmetic average of the coreference loss and linking loss:

Experiments
In this section, we introduce the experimental details to demonstrate the effectiveness of C 2 .

Data Description
We use the latest released character identification V2.0 2 as the experimental dataset, and we follow the standard training, developing, and testing separation provided by the dataset. In the dataset, all mentions are annotated with their referent global entities. For example, in Figure 4, the mention I is assigned to ROSS, and the mentions mom and dad are assigned to JUDY and JACK respectively in the first utterance given by Ross.

Baseline Methods
The effectiveness of the joint learning model is evaluated on both the coreference resolution and character linking tasks. To fairly compare with existing models, only the singular mentions are used following the singular-only setting (S-only) in the previous work (Zhou and Choi, 2018). For the coreference resolution task, we compare with the following methods.
• ACNN: A CNN-based model (Zhou and Choi, 2018) coreference resolution model that can also produce the mention and mention-cluster embeddings at the same time.
• CorefQA: An approach that reformulates the coreference resolution problem as a question answering problem (Wu et al., 2020) and being able to be benefited from fine-tuned question-answer text encoders.  For the character linking task, we also include ACNN as a baseline method. Considering existing general entity linking models (Kolitsas et al., 2018;van Hulst et al., 2020;Raiman and Raiman, 2018;Onando Mulang et al., 2020) cannot be applied to the character linking problem because they are not designed to handle pronouns, we propose another text-span classification model with transformer encoder as another strong baseline for the character linking task.
• ACNN: A model that uses the mention and mention-cluster embeddings as input to do character linking (Zhou and Choi, 2018).
• BERT/SpanBERT: A text-span classification model consists of a transformer text encoder followed by a feed-forward network.

Evaluation Metrics
We follow the previous work (Zhou and Choi, 2018) for the evaluation metrics. Specifically, for coreference resolution, three evaluation metrics, B 3 , CEAF φ4 , and BLANC, are used. The metrics are all proposed by the CoNNL'12 shared task (Pradhan et al., 2012) to evaluate the output coreference cluster against the gold clusters. We follow Zhou and Choi (2018) to use BLANC (Recasens and Hovy, 2011) to replace MUC (Vilain et al., 1995) because BLANC takes singletons into consideration but MUC does not. As for the character linking task, we use the Micro and Macro F1 scores to evaluate the multi-class classification performance.

Implementation Details
In our experiments, we consider four different pretrained language encoders: BERT-Base, BERT-Large, SpanBERT-Base, and SpanBERT-Large, and we use n = 2 layers of the mention-level self-attention (MLSA). The feed-forward networks are implemented by two fully connected layers with ReLU activations. Following the previous work, (Zhou and Choi, 2018), the scene-level setting is used, where, each scene is regarded as a document for coreference resolution and linking. During the training, each mini-batch consists of segments obtained from a single document. The joint learning model is optimized with the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 3e-5, and a warming-up rate of 10%. The model is set to be trained for 100 epochs with an early stop. All the experiments are repeated three times, and the average results are reported.

Results and Analysis
In this section, we discuss the experimental results and present a detailed analysis.

Coreference Resolution Results
The performances of coreference resolution models are shown in Table 2. C 2 with SpanBERT-large achieves the best performance on all evaluation metrics. Comparing to the baseline ACNN model, which uses hand-crafted features, C 2 uses a transformer to better encode the contextual information.
Besides that, even though ACNN formulates the coreference resolution and character linking tasks in a pipe-line and uses the coreference resolution result to help character linking, the character linking result cannot be used to help to resolve coreference clusters. As a comparison, we treat both tasks jointly such that they can help each other. Currently, CorefQA is the best-performing general coreference resolution model on the OntoNotes dataset (Pradhan et al., 2012). However, its performance is limited on the conversation dataset due to   (Rajpurkar et al., 2018)) is more similar to OntoNotes rather than the used multiparty conversation dataset, which is typically much more informal. As a result, the effect of such fine-tuning process only works on OntoNotes.
The coarse-to-fine (C2F) model (Joshi et al., 2019b) with a transformer encoder was the previous state-of-the-art model on OntoNotes. Referring to Table 2, given the same text encoder, the proposed C 2 model can constantly outperform the C2F model. These results further demonstrate that with the help of the proposed joint learning framework, the out-of-context character information can help achieve better mention representations so that the coreference models can resolve them more easily. Table 3, the proposed joint learning model also achieves the best performance on the character linking task and there are mainly two reasons for that. First, the contextualized mention representations obtained from pre-trained language encoders can better encode the context information than those representations used in ACNN. Second, with the help of coreference clusters, richer context about the whole conversation is encoded for each mention. For example, when using the same pretrained language model as the encoder, C 2 can always outperform the baseline classification model. These empirical results confirm that, though the BERT and SpanBERT can produce very good vector representation for the mentions based on the local context, the coreference clusters can still provide useful document-level contextual information for linking them to a global character entity.

The Number of MLSA Layers
Another contribution of the proposed C 2 model is the proposed mention-level self-attention (MLSA) module, which helps iteratively refine the mention representations according to the other mentions co-occurred within the same document. In this section, to show its effect and the influence of iteration layers, we tried different layers and show their per-  Figure 6: Case study. All mentions that are linked to the same character and in the same coreference cluster are highlighted with the same color. The misclassified mention is marked with the red cross.  Table 4: Three ablation studies are conducted concerning the MLSA layers, the coreference resolution module, and the character linking module.
formances on the test set in Figure 5. We conducted the experiments with the SpanBERT-Base encoder and all other hyper-parameters are the same. The x-axis is the number of layers, and the y-axes are F1 scores of B3, CEAF, and BLANC for coreference resolution, the Macro and Micro F1 scores for character linking. From the results, we can see that with the increase of layer number from zero to five, the F1 scores on both tasks gradually increase. This trend demonstrates that the model can perform better on both tasks when there are more layers. Meanwhile, the marginal performance improvement of the MLSA layer is decreasing. This indicates that adding too many layers of MLSA may not further help improve the performance because enough context has been included. Considering the balance between performance and computational efficiency, we chose the iteration layers to be two in our current model based on similar observations made on the development set.

Ablation Study
In this section, we present the ablation study to clearly show the effect of different modules in the proposed framework C 2 in Table 4. First, we try to remove the mention-level self-attention (MLSA) from our joint learning model and a clear performance drop is observed on both tasks. Specifically, the performance on coreference resolution is reduced by 1.21 on the average F1, and meanwhile, the macro-F1 and micro-F1 scores on character linking decreased by 0.77 and 0.79 respectively. The reduction reveals that the MLSA indeed helps achieve better mention representations with the help from co-occurred mentions. Second, we try to remove the coreference resolution and character linking modules. When the character linking module is removed, it is observed that the performance on coreference resolution decreased by 1.94 on the averaged F1 score. When the coreference module is removed, the performance of C 2 on character linking dropped by 0.83 on the average of Micro and Macro F1 scores. These results prove that the modeling of coreference resolution and character linking can indeed help each other and improve the performance significantly, and the proposed joint learning framework can help to achieve that goal.

Case Study
Besides the quantitative evaluation, in this section, we present the case study to qualitatively evaluate the strengths and weaknesses of the proposed C 2 model. As shown in Figure 6, we randomly select an example from the development set to show the prediction results of the proposed model on both tasks. To illustrate the coreference resolution and character linking results from the C 2 model, the mentions from the same coreference cluster are highlighted with the same color. Also, we use the same color to indicate to which character the mentions are referring. Meanwhile, the falsely predicted result is marked with a red cross.

Strengths
For this example, the results on both tasks are consistent. The mentions that are linked to the same character entity are in the same coreference group and vice versa. Based on this observation and previous experimental results, it is more convincing that the proposed model can effectively solve the two problems at the same time. Besides that, we also notice that the model does not overfit the popular characters. It can correctly solve all the mentions referring to not only main characters, and also for the characters that only appear several times such as MAN 1. Last but not least, the proposed model can correctly resolve the mention to the correct antecedent even though there is a long distance between them in the conversation. For example, the mention me in utterance 14 can be correctly assigned to the mention you in utterance 2, though there are 11 utterances in between. It shows that by putting two tasks together, the proposed model can better utilize the whole conversation context. The only error made by the model is incorrectly classifying a mention and at the same time putting it into a wrong coreference cluster.

Weaknesses
By analyzing the error case, it is noticed that the model may have trouble in handling the mentions that require common sense knowledge. Humans can successfully resolve the mention her to Danielle because they know Danielle is on the other side of the telephone, but Monica is in the house. As a result, Chandler can only deceive Danielle but not Monica. But the current model, which only relies on the context, cannot tell the difference.

Error Analysis
We use the example in Figure 6 to emphasize the error analysis that compares the performance of our model and the baseline models. The details are as follows. In this example, the only mistake made by our model is related to common-sense knowledge, and the baseline models are also not able to make a correct prediction. For coreference resolution, 3 out of 25 mentions are put into a wrong cluster by the c2f baseline model. The baseline model failed to do longdistance antecedent assignments (e.g., the "me" in utterance 14). Meanwhile, our model is better in this case because it successfully predicts the antecedent of the mention "me", even though its corresponding antecedent is far away in utterance 2. This example demonstrates the advantage that our joint model can use global information obtained from character linking to better resolve the co-referents that are far away from each other.
For character linking, 2 out of 25 mentions are linked to the wrong characters by the baseline model. It is observed that the baseline model cannot consistently make correct linking predictions to less-appeared characters, for example, the "He" in utterance 6. In this case, our model performs better mainly because it can use the information gathered from the nearby co-referents to adjust its linking prediction, as its nearby co-referents are correctly linked to corresponding entities.

Related Works
Coreference resolution is the task of grouping mentions to clusters such that all the mentions in the same cluster refer to the same real-world entity (Pradhan et al., 2012;Zhang et al., 2019a,b;Yu et al., 2019). With the help of higher-order coreference resolution mechanism  and strong pre-trained language models (e.g., Span-BERT (Joshi et al., 2019b)), the end-to-end based coreference resolution systems have been achieving impressive performance on the standard evaluation dataset (Pradhan et al., 2012). Recently, motivated by the success of the transfer learning, Wu et al. (2020) propose to model the coreference resolution task as a question answering problem. Through the careful fine-tuning on a high-quality QA dataset (i.e., SQUAD-2.0 (Rajpurkar et al., 2018)), it achieves the state-of-the-art performance on the standard evaluation benchmark. However, as disclosed by Zhang et al. (2020), current systems are still not perfect. For example, they still cannot effectively handle pronouns, especially those in informal language usage scenarios like conversations. In this paper, we propose to leverage the out-of-context character information to help resolve the coreference relations with a joint learning model, which has been proven effective in the experiments. As a traditional NLP task, entity linking (Mihalcea and Csomai, 2007;Ji et al., 2015;Kolitsas et al., 2018;Raiman and Raiman, 2018;Onando Mulang et al., 2020;van Hulst et al., 2020) aims at linking mentions in context to entities in the real world (typically in the format of knowledge graph). Typically, the mentions are named entities and the main challenge is the disambiguation. However, as a special case of the entity linking, the character linking task has its challenge that the majority of the mentions are pronouns. In the experiments, we have demonstrated that when the local context is not enough, the richer context information provided by the coreference clusters could be very helpful for linking mentions to the correct characters.
In the NLP community, people have long been thinking that the coreference resolution task and entity linking should be able to help each other. For example, Ratinov and Roth (2012) show how to use knowledge from named-entity linking to improve the coreference resolution, but do not consider doing it in a joint learning approach. After that, Hajishirzi et al. (2013) demonstrate that the coreference resolution and entity linking are complementary in terms of reducing the errors in both tasks. Motivated by these observations, a joint model for coreference, typing, and linking is proposed (Durrett and Klein, 2014) to improve the performance on three tasks at the same time. Compared with previous works, the main contributions of this paper are two-fold: (1) we tackle the challenging character linking problem; (2) we design a novel mention representation encoding method, which has been shown effective on both the coreference resolution and character linking tasks.

Conclusion
In this paper, we propose to solve the coreference resolution and character linking tasks jointly. The experimental results show that the proposed model C 2 performs better than all previous models on both tasks. Detailed analysis is also conducted to show the contribution of different modules and the effect of the hyper-parameter.