Viviane Moreira

2026

Data Augmentation for Named Entity Recognition in Domain-Specific Scenarios in Portuguese
Higor Moreira | Patricia Ferreira da Silva | Luciana Bencke | Viviane Moreira
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Named Entity Recognition (NER) is an important task of Natural Language Processing. Achieving good results in this task usually requires a large amount of labeled data to train models. This is especially difficult in domain-specific datasets and low-resourced languages. To mitigate the high cost of human-annotated data, data augmentation can be used. In this work, we evaluate Data Augmentation techniques for NER, focusing on domain-specific datasets in Portuguese.We employed augmentation techniques based on rules, back-translation, and large language models on four datasets of varying sizes to train Transformer-based NER models.The results showed that most techniques improved over the baseline, with the best results achieved using PP-LLM, SR, and MR.

2024

pdf bib abs

Evaluation of Question Answer Generation for Portuguese: Insights and Datasets
Felipe Paula | Cassiana Roberta Lizzoni Michelin | Viviane Moreira
Findings of the Association for Computational Linguistics: EMNLP 2024

Automatic question generation is an increasingly important task that can be applied in different settings, including educational purposes, data augmentation for question-answering (QA), and conversational systems. More specifically, we focus on question answer generation (QAG), which produces question-answer pairs given an input context. We adapt and apply QAG approaches to generate question-answer pairs for different domains and assess their capacity to generate accurate, diverse, and abundant question-answer pairs. Our analyses combine both qualitative and quantitative evaluations that allow insights into the quality and types of errors made by QAG methods. We also look into strategies for error filtering and their effects. Our work concentrates on Portuguese, a widely spoken language that is underrepresented in natural language processing research. To address the pressing need for resources, we generate and make available human-curated extractive QA datasets in three diverse domains.

pdf bib abs

InferBR: A Natural Language Inference Dataset in Portuguese
Luciana Bencke | Francielle Vasconcellos Pereira | Moniele Kunrath Santos | Viviane Moreira
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Natural Language Inference semantic concepts are central to all aspects of natural language meaning. Portuguese has few NLI-annotated datasets created through automatic translation followed by manual checking. The manual creation of NLI datasets is complex and requires many efforts that are sometimes unavailable. Thus, investments to produce good quality synthetic instances that could be used to train machine learning models for NLI are welcome. This work produced InferBR, an NLI dataset for Portuguese. We relied on a semiautomatic process to generate premises and an automatic process to generate hypotheses. The dataset was manually revised, showing that 97.4% of the sentence pairs had good quality, and nearly 100% of the instances had the correct label assigned. The model trained with InferBR is better at recognizing entailment classes in the other Portuguese datasets than the reverse. Because of its diversity and many unique sentences, InferBR can potentially be further augmented. In addition to the dataset, a key contribution is our proposed generation processes for premises and hypotheses that can easily be adapted to other languages and tasks.

2023

pdf bib abs

Team INF-UFRGS at SemEval-2023 Task 7: Supervised Contrastive Learning for Pair-level Sentence Classification and Evidence Retrieval
Abel Corrêa Dias | Filipe Dias | Higor Moreira | Viviane Moreira | João Luiz Comba
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper describes the EvidenceSCL system submitted by our team (INF-UFRGS) to SemEval-2023 Task 7: Multi-Evidence Natural Language Inference for Clinical Trial Data (NLI4CT). NLI4CT is divided into two tasks, one for determining the inference relation between a pair of statements in clinical trials and a second for retrieving a set of supporting facts from the premises necessary to justify the label predicted in the first task. Our approach uses pair-level supervised contrastive learning to classify pairs of sentences. We trained EvidenceSCL on two datasets created from NLI4CT and additional data from other NLI datasets. We show that our approach can address both goals of NLI4CT, and although it reached an intermediate position, there is room for improvement in the technique.

2022

pdf bib abs

INF-UFRGS at SemEval-2022 Task 5: analyzing the performance of multimodal models
Gustavo Lorentz | Viviane Moreira
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper describes INF-UFRGS submission for SemEval-2022 Task 5 Multimodal Automatic Misogyny Identification (MAMI). Unprecedented levels of harassment came with the ever-growing internet usage as a mean of worldwide communication. The goal of the task is to improve the quality of existing methods for misogyny identification, many of which require dedicated personnel, hence the need for automation. We experimented with five existing models, including ViLBERT and Visual BERT - both uni and multimodally pretrained - and MMBT. The datasets consist of memes with captions in English. The results show that all models achieved Macro-F1 scores above 0.64. ViLBERT was the best performer with a score of 0.698.

pdf bib abs

UFRGSent at SemEval-2022 Task 10: Structured Sentiment Analysis using a Question Answering Model
Lucas Pessutto | Viviane Moreira
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper describes the system submitted by our team (UFRGSent) to SemEval-2022 Task 10: Structured Sentiment Analysis. We propose a multilingual approach that relies on a Question Answering model to find tuples consisting of aspect, opinion, and holder. The approach starts from general questions and uses the extracted tuple elements to find the remaining components. Finally, we employ an aspect sentiment classification model to classify the polarity of the entire tuple. Despite our method being in a mid-rank position on SemEval competition, we show that the question-answering approach can achieve good coverage retrieving sentiment tuples, allowing room for improvements in the technique.

2020

pdf bib abs

Offensive Video Detection: Dataset and Baseline Results
Cleber Alcântara | Viviane Moreira | Diego Feijo
Proceedings of the Twelfth Language Resources and Evaluation Conference

Web-users produce and publish high volumes of data of various types, such as text, images, and videos. The platforms try to restrain their users from publishing offensive content to keep a friendly and respectful environment and rely on moderators to filter the posts. However, this method is insufficient due to the high volume of publications. The identification of offensive material can be performed automatically using machine learning, which needs annotated datasets. Among the published datasets in this matter, the Portuguese language is underrepresented, and videos are little explored. We investigated the problem of offensive video detection by assembling and publishing a dataset of videos in Portuguese containing mostly textual features. We ran experiments using popular machine learning classifiers used in this domain and reported our findings, alongside multiple evaluation metrics. We found that using word embedding with Deep Learning classifiers achieved the best results on average. CNN architectures, Naive Bayes, and Random Forest ranked top among different experiments. Transfer Learning models outperformed Classic algorithms when processing video transcriptions, but scored lower using other feature sets. These findings can be used as a baseline for future works on this subject.

pdf bib abs

This work focuses on Portuguese Named Entity Recognition (NER) in the Geology domain. The only domain-specific dataset in the Portuguese language annotated for NER is the GeoCorpus. Our approach relies on BiLSTM-CRF neural networks (a widely used type of network for this area of research) that use vector and tensor embedding representations. Three types of embedding models were used (Word Embeddings, Flair Embeddings, and Stacked Embeddings) under two versions (domain-specific and generalized). The domain specific Flair Embeddings model was originally trained with a generalized context in mind, but was then fine-tuned with domain-specific Oil and Gas corpora, as there simply was not enough domain corpora to properly train such a model. Each of these embeddings was evaluated separately, as well as stacked with another embedding. Finally, we achieved state-of-the-art results for this domain with one of our embeddings, and we performed an error analysis on the language model that achieved the best results. Furthermore, we investigated the effects of domain-specific versus generalized embeddings.

2019

pdf bib abs

Summarizing Legal Rulings: Comparative Experiments
Diego Feijo | Viviane Moreira
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In the context of text summarization, texts in the legal domain have peculiarities related to their length and to their specialized vocabulary. Recent neural network-based approaches can achieve high-quality scores for text summarization. However, these approaches have been used mostly for generating very short abstracts for news articles. Thus, their applicability to the legal domain remains an open issue. In this work, we experimented with ten extractive and four abstractive models in a real dataset of legal rulings. These models were compared with an extractive baseline based on heuristics to select the most relevant parts of the text. Our results show that abstractive approaches significantly outperform extractive methods in terms of ROUGE scores.

Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms using them to collect comparable corpora on a specific domain. Then, we compare the evaluation of the focused crawling algorithms to the performance of linguistic processes executed after training with the corresponding generated corpora. Also, we propose a novel approach for focused crawling, exploiting the expressive power of multiword expressions.

2011

pdf bib

Identification and Treatment of Multiword Expressions Applied to Information Retrieval
Otavio Acosta | Aline Villavicencio | Viviane Moreira
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

Viviane Moreira

2026

2024

2023

2022

2020

2019

2018

2015

2014

2011

Co-authors

Venues