Mirella M. Moro
2024
Evaluating Pre-training Strategies for Literary Named Entity Recognition in Portuguese
Mariana O. Silva
|
Mirella M. Moro
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
PPORTAL_ner: An Annotated Corpus of Portuguese Literary Entities
Mariana O. Silva
|
Mirella M. Moro
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The intersection of natural language processing (NLP) and literary analysis has yielded valuable insights and applications across various languages. However, the scarcity of labeled data tailored for Portuguese literary texts poses a notable challenge. To address this gap, we present the PPORTAL_ner corpus, an annotated dataset that simplifies the development of Named Entity Recognition (NER) models specifically adapted for Portuguese literary works. Our corpus includes annotations of PER, LOC, GPE, ORG, and DATE entities within a diverse set of 25 literary texts. Annotation of the corpus involved a two-step process: initial pre-annotation using a pre-trained spaCy model followed by correction and refinement using the Prodigy annotation tool. With a total of 125,059 tokens and 5,266 annotated entities, PPORTAL_ner corpus significantly enriches the landscape of resources available for computational literary analysis in Portuguese. This paper details the annotation methodology, guidelines, and dataset statistics while also evaluating four NER models over the PPORTAL_ner corpus. Our evaluation analysis reveals that fine-tuning on domain-specific data significantly improves NER model performance, demonstrating the value of the PPORTAL_ner corpus for developing domain-specific language models.
Search