Resource of Wikipedias in 31 Languages Categorized into Fine-Grained Named Entities

Satoshi Sekine, Kouta Nakayama, Masako Nomoto, Maya Ando, Asuka Sumida, Koji Matsuda


Abstract
This paper describes a resource of Wikipedias in 31 languages categorized into Extended Named Entity (ENE), which has 219 fine-grained NE categories. We first categorized 920 K Japanese Wikipedia pages according to the ENE scheme using machine learning, followed by manual validation. We then organized a shared task of Wikipedia categorization into 30 languages. The training data were provided by Japanese categorization and the language links, and the task was to categorize the Wikipedia pages into 30 languages, with no language links from Japanese Wikipedia (20M pages in total). Thirteen groups with 24 systems participated in the 2020 and 2021 tasks, sharing their outputs for resource-building. The Japanese categorization accuracy was 98.5%, and the best performance among the 30 languages ranges from 80 to 93 in F-measure. Using ensemble learning, we created outputs with an average F-measure of 86.8, which is 1.7 better than the best single systems. The total size of the resource is 32.5M pages, including the training data. We call this resource creation scheme “Resource by Collaborative Contribution (RbCC)”. We also constructed structuring tasks (attribute extraction and link prediction) using RbCC under our ongoing project, “SHINRA.”
Anthology ID:
2022.coling-1.331
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3769–3777
Language:
URL:
https://aclanthology.org/2022.coling-1.331
DOI:
Bibkey:
Cite (ACL):
Satoshi Sekine, Kouta Nakayama, Masako Nomoto, Maya Ando, Asuka Sumida, and Koji Matsuda. 2022. Resource of Wikipedias in 31 Languages Categorized into Fine-Grained Named Entities. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3769–3777, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Resource of Wikipedias in 31 Languages Categorized into Fine-Grained Named Entities (Sekine et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.331.pdf
Data
FIGER