pdf
bib
Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference
Maciej Ogrodniczuk
|
Michal Novak
|
Massimo Poesio
|
Sameer Pradhan
|
Vincent Ng
pdf
bib
abs
Referential ambiguity and clarification requests: comparing human and LLM behaviour
Chris Madge
|
Matthew Purver
|
Massimo Poesio
In this work we examine LLMs’ ability to ask clarification questions in task-oriented dialogues that follow the asynchronous instruction-giver/instruction-follower format. We present a new corpus that combines two existing annotations of the Minecraft Dialogue Corpus — one for reference and ambiguity in reference, and one for SDRT including clarifications — into a single common format providing the necessary information to experiment with clarifications and their relation to ambiguity. With this corpus we compare LLM actions with original human-generated clarification questions, examining how both humans and LLMs act in the case of ambiguity. We find that there is only a weak link between ambiguity and humans producing clarification questions in these dialogues, and low correlation between humans and LLMs. Humans hardly ever produce clarification questions for referential ambiguity, but often do so for task-based uncertainty. Conversely, LLMs produce more clarification questions for referential ambiguity, but less so for task uncertainty. We question if LLMs’ ability to ask clarification questions is predicated on their recent ability to simulate reasoning, and test this with different reasoning approaches, finding that reasoning does appear to increase question frequency and relevancy.
pdf
bib
abs
Coreference in simplified German: Linguistic features and challenges of automatic annotation
Sarah Jablotschkin
|
Ekaterina Lapshinova-Koltunski
|
Heike Zinsmeister
In this paper, we analyse coreference annotation of the German language, focussing on the phenomenon of simplification, that is, the tendency to use words and constructions that are assumed to be easier perceived, understood, or produced. Simplification is one of the tools used by language users in order to optimise communication effectively. We are interested in how simplification is reflected in coreference in two different language products exposed to the phenomena of simplification: simultaneous interpreting and Easy German. For this, we automatically annotate simplified texts with coreference. We then evaluate the outputs of automatic annotation. In addition, we also look into quantitative distributions of some coreference features. Our findings show that although the language products under analysis diverge in terms of simplification driving factors, they share some specific coreference features. We also show that this specificity may cause annotation errors in simplified language, e.g. in non-nominal or split antecedents.
pdf
bib
abs
Revisiting the Givenness Hierarchy. A Corpus-Based Evaluation
Christian Chiarcos
Gundel et al.’s Givenness Hierarchy remains one of the most influental frameworks of Information Status to this date, and has been employed in different technical contexts to account for context-sensitive and hearer-tailored language in human-machine interaction and natural language processing as well as as a topic of linguistic inquiry. At the same time, the data basis upon which this theory has been developed remains relatively thin. Although its applicability to a broad array of languages has been repeatedly confirmed, the empirical evidence presented for certain phenomena, and in particular, with respect to demonstrative determiners and demonstrative pronouns did not always reach conventional levels of statistical significance. In this paper, we provide an empirical, corpus-based re-assessment of two seminal papers for the Givenness Hierarchy, Gundel et al. (1990) and Gundel et al. (1993), where we aim to replicate their findings on the basis of corpora with coreference annotation for their original sample of languages, i.e., Arabic, Chinese, English, Japanese, Korean, Russian and Spanish. We describe the operationalization of Gundel et al.’s ‘cognitive statuses’, their approximation by means of anaphoric relations, the preprocessing of diverse and heterogeneous corpora and evaluate Gundel et al.’s claims. Our contribution is three-fold: We evaluate the Givenness Hierarchy against quantitative data at a scale that allows to assess statistical significance, we discuss challenges and problems encountered in the process, in the preprocessing and in the interpretation of the diverse corpora, we provide two generalizations: a procedure for bootstrapping Givenness Hierarchies for other languages, and possible cross-linguistically applicable tendencies in the systems of referring expressions.
pdf
bib
abs
Mention detection with LLMs in pair-programming dialogue
Cecilia Domingo
|
Paul Piwek
|
Svetlana Stoyanchev
|
Michel Wermelinger
We tackle the task of mention detection for pair-programming dialogue, a setting which adds several challenges to the task due to the characteristics of natural dialogue, the dynamic environment of the dialogue task, and the domain-specific vocabulary and structures. We compare recent variants of the Llama and GPT families and explore different prompt and context engineering approaches. While aspects like hesitations and references to read-out code and variable names made the task challenging, GPT 4.1 approximated human performance when we provided few-shot examples similar to the inference text and corrected formatting errors.
pdf
bib
abs
The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works
Antoine Bourgois
|
Thierry Poibeau
While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.
pdf
bib
abs
Towards Adding Arabic to CorefUD
Dima Taji
|
Daniel Zeman
Training models that can perform well on various NLP tasks requires large amounts of data, which becomes even more apparent with more nuanced tasks such as anaphora and coreference resolution. This paper presents the automatic creation of an Arabic CorefUD dataset through the automatic conversion of the existing gold-annotated OntoNotes.
pdf
bib
abs
Exploring Coreference Resolution in Glosses of German Sign Language
Yuzheng Bao
|
Haixia Chai
In recent years, research on sign languages has attracted increasing attention in the NLP community and requires more effort from a linguistic perspective. In this paper, we explore coreference resolution in German Sign Language (GSL) primarily through gloss-based analysis. Specifically, in GSL glosses, we conduct a linguistic analysis of coreference, add coreference annotations based on one video, and evaluate the ability of two large language models to resolve coreference. We gain valuable insights into coreference resolution in GSL, which pave the way for future research.
pdf
bib
abs
Impact of ASR Transcriptions on French Spoken Coreference Resolution
Kirill Milintsevich
This study introduces a new ASR-transcribed coreference corpus for French and explores the transferability of coreference resolution models from human-transcribed to ASR-transcribed data. Given the challenges posed by differences in text characteristics and errors introduced by ASR systems, we evaluate model performance using newly constructed parallel human-ASR silver training and gold validation datasets. Our findings show a decline in performance on ASR data for models trained on manual transcriptions. However, combining silver ASR data with gold manual data enhances model robustness. Through detailed error analysis, we observe that models emphasizing recall are more resilient to ASR-induced errors compared to those focusing on precision. The resulting ASR corpus, along with all related materials, is freely available under the CC BY-NC-SA 4.0 license at: https://github.com/ina-foss/french-asr-coreference.
pdf
bib
abs
Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLMs Dethrone Traditional Approaches?
Michal Novák
|
Miloslav Konopik
|
Anna Nedoluzhko
|
Martin Popel
|
Ondrej Prazak
|
Jakub Sido
|
Milan Straka
|
Zdeněk Žabokrtský
|
Daniel Zeman
The paper presents an overview of the fourth edition of the Shared Task on Multilingual Coreference Resolution, organized as part of the CODI-CRAC 2025 workshop. As in the previous editions, participants were challenged to develop systems that identify mentions and cluster them according to identity coreference. A key innovation of this year’s task was the introduction of a dedicated Large Language Model (LLM) track, featuring a simplified plaintext format designed to be more suitable for LLMs than the original CoNLL-U representation. The task also expanded its coverage with three new datasets in two additional languages, using version 1.3 of CorefUD – a harmonized multilingual collection of 22 datasets in 17 languages. In total, nine systems participated, including four LLM-based approaches (two fine-tuned and two using few-shot adaptation). While traditional systems still kept the lead, LLMs showed clear potential, suggesting they may soon challenge established approaches in future editions.
pdf
bib
abs
GLaRef@CRAC2025: Should we transform coreference resolution into a text generation task?
Olga Seminck
|
Antoine Bourgois
|
Yoann Dupont
|
Mathieu Dehouck
|
Marine Delaborde
We present the submissions of our team to the Unconstrained and LLM tracks of the Computational Models of Reference, Anaphora and Coreference (CRAC2025) shared task, where we ended respectively in the fifth and the first place, but nevertheless with similar scores: average CoNLL-F1 scores of 61.57 and 62.96 on the test set, but with very large differences in computational cost. Indeed, the classical pair-wise resolution system submitted to the Unconstrained track obtained similar performance but with less than 10% of the computational cost. Reflecting on this fact, we point out problems that we ran into using generative AI to perform coreference resolution. We explain how the framework of text generation stands in the way of a reliable text-global coreference representation. Nonetheless, we realize there are many potential improvements of our LLM-system; we discuss them at the end of this article.
pdf
bib
abs
CorPipe at CRAC 2025: Evaluating Multilingual Encoders for Multilingual Coreference Resolution
Milan Straka
We present CorPipe 25, the winning entry to the CRAC 2025 Shared Task on Multilingual Coreference Resolution. This fourth iteration of the shared task introduces a new LLM track alongside the original unconstrained track, features reduced development and test sets to lower computational requirements, and includes additional datasets. CorPipe 25 represents a complete reimplementation of our previous systems, migrating from TensorFlow to PyTorch. Our system significantly outperforms all other submissions in both the LLM and unconstrained tracks by a substantial margin of 8 percentage points. The source code and trained models are publicly available at https://github.com/ufal/crac2025-corpipe.
pdf
bib
abs
Fine-Tuned Llama for Multilingual Text-to-Text Coreference Resolution
Jakub Hejman
|
Ondrej Prazak
|
Miloslav Konopík
This paper describes our approach to the CRAC 2025 Shared Task on Multilingual Coreference Resolution. We compete in the LLM track, where the systems are limited to generative text-to-text approaches. Our system is based on Llama 3.1-8B, fine-tuned to tag the document with coreference annotations. We have made one significant modification to the text format provided by the organizers: The model relies on the syntactic head for mention span representation. Additionally, we use joint pre-training, and we train the model to generate empty nodes. We provide an in-depth analysis of the performance of our models, which reveals several implementation problems. Although our system ended up in last place, we achieved the best performance on 10 datasets out of 22 within the track. By fixing the discovered problems in the post-evaluation phase, we improved our results substantially, outperforming all the systems in the LLM track and even some unconstrained track systems.
pdf
bib
abs
Few-Shot Coreference Resolution with Semantic Difficulty Metrics and In-Context Learning
Nguyen Xuan Phuc
|
Dang Van Thin
This paper presents our submission to the CRAC 2025 Shared Task on Multilingual Coreference Resolution in the LLM track. We propose a prompt-based few-shot coreference resolution system where the final inference is performed by Grok-3 using in-context learning. The core of our methodology is a difficulty- aware sample selection pipeline that leverages Gemini Flash 2.0 to compute semantic diffi- culty metrics, including mention dissimilarity and pronoun ambiguity. By identifying and selecting the most challenging training sam- ples for each language, we construct highly informative prompts to guide Grok-3 in predict- ing coreference chains and reconstructing zero anaphora. Our approach secured 3rd place in the CRAC 2025 shared task.
pdf
bib
abs
Few-Shot Multilingual Coreference Resolution Using Long-Context Large Language Models
Moiz Sajid
|
Muhammad Fraz
|
Seemab Latif
|
Zuhair Zafar
In this work, we present our system, which ranked second in the CRAC 2025 Shared Task on Multilingual Coreference Resolution (LLM Track). For multilingual coreference resolution, our system mainly uses long-context large language models (LLMs) in a few-shot in-context learning setting. Among the various approaches we explored, few-shot prompting proved to be the most effective, particularly due to the complexity of the task and the availability of high-quality data with referential relationships provided as part of the competition. We employed Gemini 2.5 Pro, one of the best available closed-source long-context LLMs at the time of submission. Our system achieved a CoNLL F1 score of 61.74 on the mini-testset, demonstrating that performance improves significantly with the number of few-shot examples provided, thanks to the model’s extended context window. While this approach comes with trade-offs in terms of inference cost and response latency, it highlights the potential of long-context LLMs for tackling multilingual coreference without task-specific fine-tuning. Although direct comparisons with traditional supervised systems are not straightforward, our findings provide valuable insights and open avenues for future work, particularly in expanding support for low-resource languages.