ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages Using Wikidata

Jonne Sälevä, Constantine Lignos


Abstract
We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.
Anthology ID:
2024.lrec-main.1103
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
12599–12610
Language:
URL:
https://aclanthology.org/2024.lrec-main.1103
DOI:
Bibkey:
Cite (ACL):
Jonne Sälevä and Constantine Lignos. 2024. ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages Using Wikidata. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12599–12610, Torino, Italia. ELRA and ICCL.
Cite (Informal):
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages Using Wikidata (Sälevä & Lignos, LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1103.pdf