Ensemble Transfer Learning for Multilingual Coreference Resolution

Entity coreference resolution is an important research problem with many applications, including information extraction and question answering. Coreference resolution for English has been studied extensively. However, there is relatively little work for other languages. A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. To overcome this challenge, we design a simple but effective ensemble-based framework that combines various transfer learning (TL) techniques. We first train several models using different TL methods. Then, during inference, we compute the unweighted average scores of the models’ predictions to extract the final set of predicted clusters. Furthermore, we also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts. Leveraging the idea that the coreferential links naturally exist between anchor texts pointing to the same article, our method builds a sizeable distantly-supervised dataset for the target language that consists of tens of thousands of documents. We can pre-train a model on the pseudo-labeled dataset before finetuning it on the final target dataset. Experimental results on two benchmark datasets, OntoNotes and SemEval, confirm the effectiveness of our methods. Our best ensembles consistently outperform the baseline approach of simple training by up to 7.68% in the F1 score. These ensembles also achieve new state-of-the-art results for three languages: Arabic, Dutch, and Spanish.


Introduction
Within-document entity coreference resolution is the process of clustering entity mentions in a document that refer to the same entities (Ji et al., 2005;Luo and Zitouni, 2005;Ng, 2010Ng, , 2017)).It is an important research problem, with applications in 1 Data and code will be made available upon publication.
various downstream tasks such as entity linking (Ling et al., 2015;Kundu et al., 2018), question answering (Dhingra et al., 2018), and dialog systems (Gao et al., 2019).Researchers have recently proposed many neural methods for coreference resolution, ranging from span-based end-to-end models (Lee et al., 2017(Lee et al., , 2018) ) to formulating the task as a question answering problem (Wu et al., 2020b).Given enough annotated training data, deep neural networks can learn to extract useful features automatically.As a result, on English benchmarks with abundant labeled training documents, the mentioned neural methods consistently outperform previous handcrafted feature-based techniques (Raghunathan et al., 2010;Lee et al., 2013), achieving new state-of-the-art (SOTA) results.
Compared to the amount of research on English coreference resolution, there is relatively little work for other languages.A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.For example, the benchmark OntoNotes dataset contains about eight times more documents in English than in Arabic (Pradhan et al., 2012).Some recent studies aim to overcome this challenge by applying standard cross-lingual transfer learning (TL) methods such as continued training or joint training (Kundu et al., 2018;Pražák et al., 2021).In continued training, a model pretrained on a source dataset is further finetuned on a (typically smaller) target dataset (Xia and Van Durme, 2021).In joint training, a model is trained on the concatenation of the source and target datasets (Min, 2021).The mentioned studies only use one transfer method at a time, and they do not explore how to combine multiple TL techniques effectively.This can be sub-optimal since different learning methods can be complementary (Liu et al., 2019;Li et al., 2021).For example, our experimental results to be discussed later show that continued training and joint training are highly complementary.Furthermore, a disadvantage of using a cross-lingual transfer method is the requirement of a labeled coreference resolution dataset in some source language (usually English).
In this work, we propose an effective ensemblebased framework for combining various TL techniques.We first train several coreference models using different TL methods.During inference, we compute the unweighted average scores of the models' predictions to extract the final set of mention clusters.We also propose a low-cost TL method that bootstraps coreference models without using a labeled dataset in some source language.The basic idea is that the coreference relation often holds between anchor texts pointing to the same Wikipedia article.Based on this observation, our TL method builds a sizable distantly-supervised dataset for the target language from Wikipedia.We can then pretrain a model on the pseudo-labeled dataset before finetuning it on the final target dataset.Experimental results on two datasets, OntoNotes and SemEval (Recasens et al., 2010), confirm the effectiveness of our proposed methods.Our best ensembles outperform the baseline approach of simple training by up to 7.68% absolute gain in the F1 score.These ensembles also achieve new SOTA results for three languages: Arabic, Dutch, and Spanish.
In summary, our main contributions include: • We introduce an ensemble-based framework that combines various TL methods effectively.• We design a new TL method that leverages Wikipedia to bootstrap coreference models.• Extensive experimental results show that our proposed methods are highly effective and provide useful insights into entity coreference resolution for non-English languages.

Methods
Figure 1 shows an overview of our framework.During the training stage, we train several coreference resolution models using various TL approaches.
For simplicity, we use the same span-based architecture (Section 2.1) for every model to be trained.However, starting from the same architecture, using different learning methods typically results in models with different parameters.In this work, our framework uses two types of TL methods: (a) crosslingual TL approaches (Section 2.2) and (b) our newly proposed Wikipedia-based approach (Section 2.3).The cross-lingual TL methods require a labeled coreference resolution dataset in some source language, but our Wikipedia-based method does not have that limitation.Our framework is general as it can work with other learning methods (e.g., self-distillation).During inference, we use a simple unweighted averaging method to combine the trained models' predictions (Section 2.4).

Span-based End-to-End Coreference Resolution
In this work, the architecture of every model is based on the popular span-based e2e-coref model (Lee et al., 2017).Given an input document consisting of n tokens, our model first forms a contextualized representation for each input token using a multilingual Transformer encoder such as XLM-R (Conneau et al., 2020).Let X = (x 1 , ..., x n ) be the output of the encoder.For each candidate span i, we define its representation g i as: where START(i) and END(i) denote the start and end indices of span i respectively.xi is an attentionweighted sum of the token representations in the span (Lee et al., 2017).φ(s i ) is a feature vector encoding the size of the span.
To maintain tractability, we only consider spans with up to L tokens.The value of L is selected empirically and set to be 30.All the span representations are fed into a mention scorer s m (.): where FFNN m is a feedforward neural network with ReLU activations.Intuitively, s m (i) indicates whether span i is indeed an entity mention.
After scoring the spans using FFNN m , we only keep spans with high mention scores2 .We denote the set of the unpruned spans as S.Then, for each remaining span i ∈ S, the model predicts a distribution P (j) over its antecedents3 j ∈ Y (i): where Y (i) = { , 1, ..., i − 1} is a set consisting of a dummy antecedent and all spans that precede i.
The dummy antecedent represents two possible cases: (1) the span i is not an entity mention, or (2) the span i is an entity mention, but it is not coreferential with any remaining preceding span.FFNN s is a feedforward network, and • is elementwise multiplication.φ(i, j) encodes the distance between the two spans i and j.Finally, note that s(i, ) is fixed to be 0.
Given a labeled document D and a model with parameters θ, we define the mention detection loss: where ŷi = sigmoid(s m (i)), and y i = 1 if and only if span i is in one of the gold-standard mention clusters.In addition, we also want to maximize the marginal log-likelihood of all correct antecedents implied by the gold-standard clustering: where GOLD(i) are gold antecedents for span i. P (ŷ) is calculated using Equation 3. Our final loss combines mention detection and clustering:

Cross-Lingual Transfer Learning
Inspired by previous studies (Xia and Van Durme, 2021;Min, 2021;Pražák et al., 2021), we investigate two different cross-lingual transfer learning methods: continued training and joint training.
Both methods assume the existence of a labeled dataset in some source language.In this work, we use the English OntoNotes dataset (Pradhan et al., 2012) as the source dataset, as it contains nearly 3,500 annotated documents (Table 1).
Continued Training.We first train a coreference resolution model on the source dataset until convergence.After that, we further finetune the pretrained model on a target dataset.More formally, let M (f, θ 0 ) denote an optimization procedure for f with initial guess θ 0 .This optimization procedure can, for example, be the application of some stochastic gradient descent algorithm.Also, let S be the set of all training documents in the source dataset, and let T denote the set of all training documents in the target dataset.Then, the first stage of continued training can be described as: where θ0 is randomly initialized.Then, the second stage can be described using Equation 6: Here, θ2 is the parameter set of the final model.
Joint Training.We combine both the source and target datasets to train a model.More specifically, using the same notations as above, we can describe joint training by the following equation: where θ0 is randomly initialized, and θ1 is the parameter set of the final model.

Bootstrapping using Wikipedia Hyperlinks
While cross-lingual methods such as continued training and joint training are conceptually simple and typically effective (Huang et al., 2020), they require the existence of a labeled dataset in some source language.To overcome this limitation, we propose an inexpensive TL method that bootstraps coreference models by utilizing Wikipedia anchor texts.The basic idea is that two anchor texts pointing to the same Wikipedia page are likely coreferential (See Figure 2 for an example).Our method builds a large distantly-supervised dataset W for the target language by leveraging this observation: where D i is a text document constructed from some Wikipedia page written in the target language.The number of mentions in D i is the same as the number of anchor texts in the text portion of the original Wikipedia article.We consider two mentions in D i to be coreferential if and only if their corresponding anchor texts point to the same article.
After constructing W, we follow a two-step process similar to the continued training approach.We first train a conference resolution model on W until convergence.Then, we finetune the pre-trained model on the final target dataset.
Compared to a manually-labeled dataset, W has several disadvantages.Not all entity mentions are exhaustively marked in Wikipedia documents.For example, in Spanish Wikipedia, pronouns are typically not annotated.Nevertheless, since Wikipedia is one of the largest multilingual repositories for information, W is generally large (see Table 1 for some statistics), and it contains documents on various topics.As such, W can still provide some useful distant supervision signals, and so it can serve as a source dataset in the TL process.

Ensemble-Based Coreference Resolution
During the training stage, we train three different coreference resolution models using the TL approaches described above.At test time, we use a simple unweighted averaging method to combine the models' predictions.More specifically, for a candidate span i with no more than L tokens, we compute its mention score as follows: where s m,1 (i), s m,2 (i), and s m,3 (i) are the mention scores produced by the three trained models separately (refer to Equation 2).Intuitively, these scores indicate whether span i is an entity mention.
Similar to the process described in Section 2.1, after scoring every span whose length is no more than L using Equation 9, we only keep spans with high mention scores4 .Then, for each remaining span i, we predict a distribution over its antecedents j ∈ Y (i) as follows: where s 1 (i, j), s 2 (i, j), and s 3 (i, j) are the pairwise scores produced by the trained models separately (Eq.3).We fix s ensemble (i, ) to be 0.After computing the antecedent distribution for each remaining span, we can extract the final set of mention clusters.Note that while we consider only three individual TL methods in this work, Equation 9and Equation 10 can easily be extended for the case when we use more TL methods.
Datasets Table 1 shows the basic statistics of all the datasets we used in this work.When using a  (Recasens et al., 2010), and SemEval Spanish (Recasens et al., 2010).These datasets contain data in three different languages.
OntoNotes does not annotate singleton mentions (i.e., noun phrases not involved in any coreference chain).It only has annotations for non-singleton mentions.SemEval has annotations for singletons.

Wikipedia-based Dataset Construction
To construct a distantly-supervised dataset, we first download a complete Wikipedia dump in the target language.We then extract clean text and hyperlinks from the dump using WikiExtractor5 .
For each preprocessed article, we cluster its anchor texts based on the destinations of their hyperlinks.We also filter out articles with too few coreference links (e.g., articles that only have singleton mentions).
General Hyperparameters We use two different learning rates, one for the lower pretrained Transformer encoder and one for the upper layers.For every setting, the lower learning rate is 1e-5, the upper learning rate is 1e-4, and the span length limit L is 30.2020) as the Transformer encoder.GigaBERT is an English-Arabic bilingual language model pretrained from the English and Arabic Gigaword corpora.When the target dataset is SemEval Dutch or SemEval Spanish, we use the multilingual XLM-RoBERTa (XLM-R) Transformer model (Conneau et al., 2020).More specifically, we use the base version of XLM-R (i.e., xlm-roberta-base).
Span Pruning As described in Section 2.1, after computing a mention score for each span whose length is not more than L, we only keep spans with high scores.More specifically, when working with a dataset from OntoNotes (e.g., OntoNotes Arabic), we only keep up to λn spans with the highest mention scores (Lee et al., 2017).The value of λ is selected empirically and set to be 0.18.When working with any other dataset, we keep every span that has a positive mention score.

Overall Results
Table 2 shows the overall performance of different approaches.Our baseline approach is to simply train a model with the architecture described in Section 2.1 using only the target dataset of interest.Overall, the performance of a model trained using the baseline approach is positively correlated with the size of the corresponding target dataset, which is expected.A surprising finding is that our baseline approach already outperforms the previous SOTA method for SemEval Spanish (Xia and Van Durme, 2021) by 20.98% in the F1 score.We speculate that the previous SOTA model for Se-mEval Spanish is severely undertrained.
Table 2 also shows the results of using different TL methods individually.Each of the TL methods can help improve the coreference resolution performance.While continued training seems to be the most effective approach, it requires the existence of a source dataset (OntoNotes English in this case).On the other hand, our newly proposed Wikipedia- based method can help improve the performance without relying on any labeled source dataset.
Finally, Table 2 also shows the results of using different combinations of learning approaches.Our simple unweighted averaging method is effective across almost all model combinations.In particular, by combining all of the three TL methods discussed previously, we can outperform the previous SOTA methods by large margins.In addition, even without using any labeled source dataset, the combination [Baseline Approach ⊕ Wikipedia Pre-Training] can still outperform the previous SOTA methods for Arabic and Spanish.This further confirms the usefulness of our Wikipedia-based TL method.Lastly, combining three models trained using the same baseline approach leads to smaller gains than combining the three TL methods.This is expected as ensemble methods typically work best when the individual learners are diverse (Krogh and Vedelsby, 1994;Melville and Mooney, 2003).

How optimal is our simple unweighted averaging method?
Our averaging approach is equivalent to linear interpolation with equal weights.To analyze the optimality of our method, we compare it to the "best possible" interpolation method.
More specifically, we assume that there is an oracle that can tell us which model in an ensemble gives the most accurate prediction for a particular latent variable.Then, for example, suppose we want to score a span i using an ensemble of three models.If i is an entity mention, the oracle will tell us that the model that returns the highest mention score for i is the most accurate.Thus, we can set the score for i to be max (s m,1 (i), s m,2 (i), s m,3 (i)).Following the same logic, if i is not an entity mention, we will set its score to be min (s m,1 (i), s m,2 (i), s m,3 (i)).The same idea can be applied to compute the linking score s ensemble (i, j) between i and j.
In Table 2, we see a considerable gap between the performance of our simple averaging method and the oracle-guided interpolation method.Therefore, a promising future direction is to experiment with a more context-dependent ensemble method.Nevertheless, our averaging method is simple, and it does not require any further parameter tuning to combine a set of existing models.Finally, the performance of each oracle-guided ensemble is far from perfect, implying that improving the underlying architecture of each model can also be a worthwhile effort.

How effective is our framework in extremely low-resource settings?
We conduct experiments on OntoNotes Arabic where we assume that the training dataset for Arabic only has 10 documents and that we do not have any source dataset (Table 4).In this setting, our ensemble substantially outperforms the baseline approach by up to 5.18% in the F1 score.

Qualitative Analysis
We provide some qualitative analyses to demonstrate the strengths of our ensembles in Table 5.
In the first example, the three highlighted mentions refer to Anna Mas, the director of a center.Our model trained using joint training merged this cluster with a different cluster that refers to a different entity (not shown in the example because of space constraints).In contrast, our models trained using other TL methods did not make that error.As a result, our best ensemble for Spanish predicted the correct cluster for Anna Mas.
The second example is in Dutch.Here, mannelijk muis can be translated as male mouses, while hen can be translated as them.Our model trained using continued training failed to extract the mention mannelijk muis.Nevertheless, in the end, our ensemble for Dutch was able to extract the mention and correctly link it to the pronoun hen.

Entity Coreference Resolution
Recently, neural models for entity coreference resolution have shown superior performance over approaches using hand-crafted features.Lee et al. (2017) proposed the first end-to-end neural coreference resolution model named e2e-coref.The model uses a bi-directional LSTM and a head-finding attention mechanism to learn mention representations and calculate mention and antecedent scores.Lee ... el director del centro, Anna Mas, asegurar que el acto pretender "rechazar el agresión y concienciar a el alumno del incremento de este ataque".el director recordar otro dos agresión "por llevar hierro dental o el pelo largo".en mucho ocasión se producir asalto a niño, y el alumno, añadir Mas, "ver como algo normal que les parir por el calle y les quitar el poco dinero que llevar encima" ... ... Het vakblad Hormones and Behavior beschrijven hoe het voldoend zijn dat mannelijk muis een vleug vrouw roken om hen weinig bang te maken van kat en wezel ...  2018) extended the e2e-coref model by introducing a coarse-to-fine pruning mechanism and a higher-order inference mechanism.The model uses ELMo representations (Peters et al., 2018) instead of traditional word embeddings.The model is typically referred to as the c2f-coref model.
Almost all recent studies on entity coreference resolution are influenced by the design of c2f-coref.Joshi et al. (2019) built the c2f-coref system on top of BERT representations (Devlin et al., 2019).Fei et al. (2019) transformed c2f-coref into a policy gradient model that can optimize coreference evaluation metrics directly.Xu and Choi (2020) studied in depth the higher-order inference (HOI) mechanism of c2f-coref.The authors concluded that given a high-performing encoder such as Span-BERT (Joshi et al., 2020), the impact of HOI is negative to marginal.Another line of work aims to simplify and/or reduce the computational complexity of c2f-coref (Xia et al., 2020;Kirstain et al., 2021;Lai et al., 2021;Dobrovolskii, 2021).
The studies mentioned above only trained and evaluated models using English datasets such as OntoNotes English (Pradhan et al., 2012) and the GAP dataset (Webster et al., 2018).On the other hand, there is significantly less work on coreference resolution for other languages.For example, while e2e-coref was introduced in 2017, the first neural coreference resolver for Arabic was only recently proposed in 2020 (Aloraini et al., 2020).For Dutch, many existing systems are still using rule-based (van Cranenburgh, 2019) or traditional learning-based approaches (Hendrickx et al., 2008;De Clercq et al., 2011).Recently, Poot and van Cranenburgh (2020) evaluated the performance of c2f-coref on Dutch datasets of two different domains: literary novels and news/Wikipedia text.
While our models' architecture is based on e2ecoref (Section 2.1), we go beyond just applying the models to a non-English language in this work.We propose new TL approaches that can take advantage of existing source datasets and Wikipedia to improve the final performance.

Transfer Learning for Coreference Resolution
Compared to English datasets, the size of a coreference resolution dataset for a non-English language is typically smaller.Several recent studies aim to overcome this challenge by applying standard cross-lingual TL methods such as continued training or joint training (Kundu et al., 2018;Xia and Van Durme, 2021;Pražák et al., 2021;Min, 2021).
These studies only use one transfer method at a time, and they do not explore how to combine multiple TL techniques effectively.Our experimental results (Section 3.2) show that combining various TL techniques can substantially improve the final coreference resolution performance.
A closely related work by Yang et al. ( 2012) proposed an adaptive ensemble method to adapt coreference resolution across domains.Their study did not explicitly focus on improving coreference resolution for non-English languages.In addition, they experimented with the settings where gold standard mentions are assumed to be provided.We do not make that assumption.Each of our models does both mention extraction and linking.

Leveraging Wikipedia for Coreference Resolution
There have been studies on leveraging Wikipedia for coreference resolution.Eirew et al. (2021) recently created a large-scale cross-document event coreference dataset from English Wikipedia.For cross-document entity coreference, Singh et al. (2012) created Wikilinks by finding hyperlinks to English Wikipedia from a web crawl and using anchor text as mentions.Different from these studies, we focus on within-document entity coreference resolution.In addition, we explore coreference resolution for languages beyond English in this work.
Many previous studies leveraged Wikipedia for related tasks such as name tagging (Alotaibi and Lee, 2012;Nothman et al., 2013;Althobaiti et al., 2014) and entity linking (Pan et al., 2017;Wu et al., 2020a;Cao et al., 2021).We leave the extension of our methods to these tasks for future research.
In this work, we propose an ensemble-based framework that combines various TL techniques.We also introduce a low-cost Wikipedia-based TL approach that does not require any labeled source dataset.Our approaches are highly effective, as our best ensembles achieve new SOTA results for three different languages.An interesting future direction is to explore the use of model compression techniques (Hinton et al., 2015;Han et al., 2016;Lai et al., 2020) to reduce the computational complexity of our ensembles.

Limitations
Multilingual language models such as XLM-R (Conneau et al., 2020) and GigaBERT (Lan et al., 2020) are typically pre-trained on large amounts of unlabeled text crawled from the Web.Since these models are optimized to capture the statistical properties of the training data, they tend to pick up on and amplify social stereotypes present in the data (Kurita et al., 2019).Since our coreference resolution models use such pre-trained language models, they may also exhibit social biases present on the Web.Identifying and mitigating social biases in neural models is an active area of research (Zhao et al., 2018;Sheng et al., 2021;Gupta et al., 2022).In the future, we plan to work on removing social biases from coreference resolution models.
Furthermore, while our proposed methods are highly effective, the performance of our best ensembles is still far from perfect.On OntoNotes Arabic, our best system only achieves an F1 score of 66.72%.Such performance may not be acceptable for some downstream tasks (e.g., information extraction from critical clinical notes).
Finally, even though Wikipedia is available in more than 300 languages, there are still very few Wikipedia pages for some very rare languages.Our proposed methods are likely to be less effective for such rare languages.

A Reproducibility Information
In this section, we present the reproducibility information of our paper.
Computing Infrastructure The experiments were conducted on a server with Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz and NVIDIA Tesla V100 GPUs.GPU memory is 16G.

Number of Model Parameters
When the target dataset is OntoNotes Arabic, we use GigaBERT (Lan et al., 2020) as the Transformer encoder.Gi-gaBERT has about 125M parameters.
Hyperparameters The information about the hyperparameters is available in the main paper.The validation F1 score of the ensemble for Arabic coreference resolution is 66.60%.The total time needed for the evaluation is about 1 minute and 19 seconds.

Expected Validation Performance
The validation F1 score of the ensemble for Dutch coreference resolution is 57.81%.The total time needed for the evaluation is about 20 seconds.
The validation F1 score of the ensemble for Spanish coreference resolution is 75.73%.The total time needed for the evaluation is about 1 minute and 42 seconds.

Figure 1 :
Figure1: An overview of our framework.We first train several coreference resolution models using different TL approaches.During inference, we use a simple unweighted averaging method to combine the models' predictions.
Figure 2: Since the hyperlinks of Tigre de México and Tigres UANL point to the same Wikipedia page, a person who does not know Spanish can still guess that the two mentions are likely to be coreferential.In fact, the two mentions both refer to Tigres UANL, a Mexican professional football club.
We report the validation performance of the ensemble [Continued Training ⊕ Joint Training ⊕ Wikipedia Pre-Training].

Table 1 :
Number of documents for each of the datasets.cross-lingual TL method, we use the English portion of OntoNotes (Pradhan et al., 2012) as the source dataset.We explore three target datasets: OntoNotes Arabic (Pradhan et al., 2012), SemEval Dutch
Transformer Encoders When the target dataset is OntoNotes Arabic, we use GigaBERT(Lan et al.,

Table 3 :
Test F 1 (in %) on the target datasets and the previous SOTA on each dataset (to the best of our knowledge).

Table 4 :
Test F-score (in %) of various approaches on OntoNotes Arabic when we restrict the size of the gold Arabic training dataset to only 10 documents.

Table 5 :
Examples of mention clusters that were correctly predicted by our ensembles.Blue spans represent coreferential mentions.The first example is in Spanish.The second example is in Dutch.