Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia

Tomás Feith, Akhil Arora, Martin Gerlach, Debjit Paul, Robert West


Abstract
Links are a fundamental part of information networks, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, adding a new link to the network is not trivial: it requires not only the identification of a suitable pair of source and target entities but also the understanding of the content of the source to locate a suitable position for the link in the text. The latter problem has not been addressed effectively, particularly in the absence of text spans in the source that could serve as anchors to insert a link to the target entity. To bridge this gap, we introduce and operationalize the task of entity insertion in information networks. Focusing on the case of Wikipedia, we empirically show that this problem is, both, relevant and challenging for editors. We compile a benchmark dataset in 105 languages and develop a framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI. We show that XLocEI outperforms all baseline models (including state-of-the-art prompt-based ranking with LLMs such as GPT-4) and that it can be applied in a zero-shot manner on languages not seen during training with minimal performance drop. These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.
Anthology ID:
2024.emnlp-main.1268
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22796–22819
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1268
DOI:
Bibkey:
Cite (ACL):
Tomás Feith, Akhil Arora, Martin Gerlach, Debjit Paul, and Robert West. 2024. Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22796–22819, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia (Feith et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1268.pdf
Software:
 2024.emnlp-main.1268.software.zip