Data Augmentation for Named Entity Recognition in Domain-Specific Scenarios in Portuguese

Higor Moreira, Patricia Ferreira da Silva, Luciana Bencke, Viviane Moreira


Abstract
Named Entity Recognition (NER) is an important task of Natural Language Processing. Achieving good results in this task usually requires a large amount of labeled data to train models. This is especially difficult in domain-specific datasets and low-resourced languages. To mitigate the high cost of human-annotated data, data augmentation can be used. In this work, we evaluate Data Augmentation techniques for NER, focusing on domain-specific datasets in Portuguese.We employed augmentation techniques based on rules, back-translation, and large language models on four datasets of varying sizes to train Transformer-based NER models.The results showed that most techniques improved over the baseline, with the best results achieved using PP-LLM, SR, and MR.
Anthology ID:
2026.propor-1.25
Volume:
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:
April
Year:
2026
Address:
Salvador, Brazil
Editors:
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
250–259
Language:
URL:
https://aclanthology.org/2026.propor-1.25/
DOI:
Bibkey:
Cite (ACL):
Higor Moreira, Patricia Ferreira da Silva, Luciana Bencke, and Viviane Moreira. 2026. Data Augmentation for Named Entity Recognition in Domain-Specific Scenarios in Portuguese. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 250–259, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):
Data Augmentation for Named Entity Recognition in Domain-Specific Scenarios in Portuguese (Moreira et al., PROPOR 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.propor-1.25.pdf