PPORTAL_ner: An Annotated Corpus of Portuguese Literary Entities

Mariana O. Silva, Mirella M. Moro


Abstract
The intersection of natural language processing (NLP) and literary analysis has yielded valuable insights and applications across various languages. However, the scarcity of labeled data tailored for Portuguese literary texts poses a notable challenge. To address this gap, we present the PPORTAL_ner corpus, an annotated dataset that simplifies the development of Named Entity Recognition (NER) models specifically adapted for Portuguese literary works. Our corpus includes annotations of PER, LOC, GPE, ORG, and DATE entities within a diverse set of 25 literary texts. Annotation of the corpus involved a two-step process: initial pre-annotation using a pre-trained spaCy model followed by correction and refinement using the Prodigy annotation tool. With a total of 125,059 tokens and 5,266 annotated entities, PPORTAL_ner corpus significantly enriches the landscape of resources available for computational literary analysis in Portuguese. This paper details the annotation methodology, guidelines, and dataset statistics while also evaluating four NER models over the PPORTAL_ner corpus. Our evaluation analysis reveals that fine-tuning on domain-specific data significantly improves NER model performance, demonstrating the value of the PPORTAL_ner corpus for developing domain-specific language models.
Anthology ID:
2024.lrec-main.1132
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
12927–12937
Language:
URL:
https://aclanthology.org/2024.lrec-main.1132
DOI:
Bibkey:
Cite (ACL):
Mariana O. Silva and Mirella M. Moro. 2024. PPORTAL_ner: An Annotated Corpus of Portuguese Literary Entities. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12927–12937, Torino, Italia. ELRA and ICCL.
Cite (Informal):
PPORTAL_ner: An Annotated Corpus of Portuguese Literary Entities (Silva & Moro, LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1132.pdf
Optional supplementary material:
 2024.lrec-main.1132.OptionalSupplementaryMaterial.zip