In many NLP applications like search and information extraction for named entities, it is necessary to find all the mentions of a named entity, some of which appear as pronouns (she, his, etc.) or nominals (the professor, the German chancellor, etc.). It is therefore important that coreference resolution systems are able to link these different types of mentions to the correct entity name. We evaluate state-of-the-art coreference resolution systems for the task of resolving all mentions to named entities. Our analysis reveals that standard coreference metrics do not reflect adequately the requirements in this task: they do not penalize systems for not identifying any mentions by name to an entity and they reward systems even if systems find correctly mentions to the same entity but fail to link these to a proper name (she–the student–no name). We introduce new metrics for evaluating named entity coreference that address these discrepancies and show that for the comparisons of competitive systems, standard coreference evaluations could give misleading results for this task. We are, however, able to confirm that the state-of-the art system according to traditional evaluations also performs vastly better than other systems on the named entity coreference task.
We propose an end-to-end coreference resolution system obtained by adapting neural models that have recently improved the state-of-the-art on the OntoNotes benchmark to make them applicable to other paradigms for this task. We report the performances of our system on ANCOR, a corpus of transcribed oral French, for which it constitutes a new baseline with proper evaluation.
We explore different approaches to explicit entity modelling in language models (LM). We independently replicate two existing models in a controlled setup, introduce a simplified variant of one of the models and analyze their performance in direct comparison. Our results suggest that today’s models are limited as several stochastic variables make learning difficult. We show that the most challenging point in the systems is the decision if the next token is an entity token. The low precision and recall for this variable will lead to severe cascading errors. Our own simplified approach dispenses with the need for latent variables and improves the performance in the entity yes/no decision. A standard well-tuned baseline RNN-LM with a larger number of hidden units outperforms all entity-enabled LMs in terms of perplexity.
Clustering unlinkable entity mentions across documents in multiple languages (cross-lingual NIL Clustering) is an important task as part of Entity Discovery and Linking (EDL). This task has been largely neglected by the EDL community because it is challenging to outperform simple edit distance or other heuristics based baselines. We propose a novel approach based on encoding the orthographic similarity of the mentions using a Recurrent Neural Network (RNN) architecture. Our model adapts a training procedure from the one-shot facial recognition literature in order to achieve this. We also perform several exploratory probing tasks on our name encodings in order to determine what specific types of information are likely to be encoded by our model. Experiments show our approach provides up to a 6.6% absolute CEAFm F-Score improvement over state-of-the-art methods and successfully captures phonological relations across languages.
In the present paper, we deal with incongruences in English-German multilingual coreference annotation and present automated methods to discover them. More specifically, we automatically detect full coreference chains in parallel texts and analyse discrepancies in their annotations. In doing so, we wish to find out whether the discrepancies rather derive from language typological constraints, from the translation or the actual annotation process. The results of our study contribute to the referential analysis of similarities and differences across languages and support evaluation of cross-lingual coreference annotation. They are also useful for cross-lingual coreference resolution systems and contrastive linguistic studies.
In this paper, we present a cross-lingual neural coreference resolution system for a less-resourced language such as Basque. To begin with, we build the first neural coreference resolution system for Basque, training it with the relatively small EPEC-KORREF corpus (45,000 words). Next, a cross-lingual coreference resolution system is designed. With this approach, the system learns from a bigger English corpus, using cross-lingual embeddings, to perform the coreference resolution for Basque. The cross-lingual system obtains slightly better results (40.93 F1 CoNLL) than the monolingual system (39.12 F1 CoNLL), without using any Basque language corpus to train it.