Luciana Bencke

2026

Data Augmentation for Named Entity Recognition in Domain-Specific Scenarios in Portuguese
Higor Moreira | Patricia Ferreira da Silva | Luciana Bencke | Viviane Moreira
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Named Entity Recognition (NER) is an important task of Natural Language Processing. Achieving good results in this task usually requires a large amount of labeled data to train models. This is especially difficult in domain-specific datasets and low-resourced languages. To mitigate the high cost of human-annotated data, data augmentation can be used. In this work, we evaluate Data Augmentation techniques for NER, focusing on domain-specific datasets in Portuguese.We employed augmentation techniques based on rules, back-translation, and large language models on four datasets of varying sizes to train Transformer-based NER models.The results showed that most techniques improved over the baseline, with the best results achieved using PP-LLM, SR, and MR.

2024

pdf bib abs

InferBR: A Natural Language Inference Dataset in Portuguese
Luciana Bencke | Francielle Vasconcellos Pereira | Moniele Kunrath Santos | Viviane Moreira
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Natural Language Inference semantic concepts are central to all aspects of natural language meaning. Portuguese has few NLI-annotated datasets created through automatic translation followed by manual checking. The manual creation of NLI datasets is complex and requires many efforts that are sometimes unavailable. Thus, investments to produce good quality synthetic instances that could be used to train machine learning models for NLI are welcome. This work produced InferBR, an NLI dataset for Portuguese. We relied on a semiautomatic process to generate premises and an automatic process to generate hypotheses. The dataset was manually revised, showing that 97.4% of the sentence pairs had good quality, and nearly 100% of the instances had the correct label assigned. The model trained with InferBR is better at recognizing entailment classes in the other Portuguese datasets than the reverse. Because of its diversity and many unique sentences, InferBR can potentially be further augmented. In addition to the dataset, a key contribution is our proposed generation processes for premises and hypotheses that can easily be adapted to other languages and tasks.

Co-authors

Venues

Fix author