A Dataset for Named Entity Recognition and Entity Linking in Chinese Historical Newspapers

Baptiste Blouin, Cécile Armand, Christian Henriot


Abstract
In this study, we present a novel historical Chinese dataset for named entity recognition, entity linking, coreference and entity relations. We use data from Chinese newspapers from 1872 to 1949 and multilingual bibliographic resources from the same period. The period and the language are the main strength of the present work, offering a resource which covers different styles and language uses, as well as the largest historical Chinese NER dataset with manual annotations from this transitional period. After detailing the selection and annotation process, we present the very first results that can be obtained from this dataset. Texts and annotations are freely downloadable from the GitHub repository.
Anthology ID:
2024.lrec-main.35
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
385–394
Language:
URL:
https://aclanthology.org/2024.lrec-main.35
DOI:
Bibkey:
Cite (ACL):
Baptiste Blouin, Cécile Armand, and Christian Henriot. 2024. A Dataset for Named Entity Recognition and Entity Linking in Chinese Historical Newspapers. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 385–394, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Dataset for Named Entity Recognition and Entity Linking in Chinese Historical Newspapers (Blouin et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.35.pdf