Mariana O. Silva


2024

pdf bib
Evaluating Pre-training Strategies for Literary Named Entity Recognition in Portuguese
Mariana O. Silva | Mirella M. Moro
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

pdf bib
PPORTAL_ner: An Annotated Corpus of Portuguese Literary Entities
Mariana O. Silva | Mirella M. Moro
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The intersection of natural language processing (NLP) and literary analysis has yielded valuable insights and applications across various languages. However, the scarcity of labeled data tailored for Portuguese literary texts poses a notable challenge. To address this gap, we present the PPORTAL_ner corpus, an annotated dataset that simplifies the development of Named Entity Recognition (NER) models specifically adapted for Portuguese literary works. Our corpus includes annotations of PER, LOC, GPE, ORG, and DATE entities within a diverse set of 25 literary texts. Annotation of the corpus involved a two-step process: initial pre-annotation using a pre-trained spaCy model followed by correction and refinement using the Prodigy annotation tool. With a total of 125,059 tokens and 5,266 annotated entities, PPORTAL_ner corpus significantly enriches the landscape of resources available for computational literary analysis in Portuguese. This paper details the annotation methodology, guidelines, and dataset statistics while also evaluating four NER models over the PPORTAL_ner corpus. Our evaluation analysis reveals that fine-tuning on domain-specific data significantly improves NER model performance, demonstrating the value of the PPORTAL_ner corpus for developing domain-specific language models.

pdf bib
Unsupervised Grouping of Public Procurement Similar Items: Which Text Representation Should I Use?
Pedro P. V. Brum | Mariana O. Silva | Gabriel P. Oliveira | Lucas G. L. Costa | Anisio Lacerda | Gisele Pappa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In public procurement, establishing reference prices is essential to guide competitors in setting product prices. Group-purchased products, which are not standardized by default, are necessary to estimate reference prices. Text clustering techniques can be used to group similar items based on their descriptions, enabling the definition of reference prices for specific products or services. However, selecting an appropriate representation for text is challenging. This paper introduces a framework for text cleaning, extraction, and representation. We test eight distinct sentence representations tailored for public procurement item descriptions. Among these representations, we propose an approach that captures the most important components of item descriptions. Through extensive evaluation of a dataset comprising over 2 million items, our findings show that using sophisticated supervised methods to derive vectors for unsupervised tasks offers little advantages over leveraging unsupervised methods. Our results also highlight that domain-specific contextual knowledge is crucial for representation improvement.