MCECR: A Novel Dataset for Multilingual Cross-Document Event Coreference Resolution

Amir Pouran Ben Veyseh, Viet Lai, Chien Nguyen, Franck Dernoncourt, Thien Nguyen


Abstract
Event coreference resolution (ECR) is a critical task in information extraction of natural language processing, aiming to identify and link event mentions across multiple documents. Despite recent progress, existing datasets for ECR primarily focus on within-document event coreference and English text, lacking cross-document ECR datasets for multiple languages beyond English. To address this issue, this work presents the first multiligual dataset for cross-document ECR, called MCECR (Multilingual Cross-Document Event Coreference Resolution), that manually annotates a diverse collection of documents for event mentions and coreference in five languages, i.e., English, Spanish, Hindi, Turkish, and Ukrainian. Using sampled articles from Wikinews over various topics as the seeds, our dataset fetches related news articles from the Google search engine to increase the number of non-singleton event clusters. In total, we annotate 5,802 news articles, providing a substantial and varied dataset for multilingual ECR in both within-document and cross-document scenarios. Extensive analysis of the proposed dataset reveals the challenging nature of multilingual event coreference resolution tasks, promoting MCECR as a strong benchmark dataset for future research in this area.
Anthology ID:
2024.findings-naacl.245
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3869–3880
Language:
URL:
https://aclanthology.org/2024.findings-naacl.245
DOI:
Bibkey:
Cite (ACL):
Amir Pouran Ben Veyseh, Viet Lai, Chien Nguyen, Franck Dernoncourt, and Thien Nguyen. 2024. MCECR: A Novel Dataset for Multilingual Cross-Document Event Coreference Resolution. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3869–3880, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
MCECR: A Novel Dataset for Multilingual Cross-Document Event Coreference Resolution (Pouran Ben Veyseh et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.245.pdf
Copyright:
 2024.findings-naacl.245.copyright.pdf