Viviane P. Moreira

2026

The Inadequacy of Automatic Evaluation Metrics in Question Answering: A Case-Study in Portuguese
Júlia da Rocha Junqueira | Viviane P. Moreira
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Questions and answers are among the most fundamental forms of human communication. Question Answering (QA) is the task of correctly generating answers based on a context. To assess the success of the task, the answers are typically evaluated using traditional metrics such as BLEU, ROUGE, and METEOR. However, these metrics often fail to reflect the actual quality of the outputs. More recently, new evaluation metrics and the LLM-as-a-judge paradigm have also been applied to the evaluation of QA. To gain a deeper understanding of the capabilities and limitations of QA metrics, this work performs a comparative analysis of both traditional and more recent approaches for QA evaluation. Experiments were conducted on the Pirá dataset (in Portuguese) using four LLMs to generate answers. Additionally, human evaluation was performed to assess aspects such as correctness, completeness, clarity, and relevance of the generated content. We demonstrate that lexical metrics are limited in evaluating QA. We also observed that human evaluators favor models that provide higher information density, even when this contradicts prompt constraints, whereas lexical metrics penalize this verbosity. This divergence confirms that traditional metrics are insufficient for capturing the trade-off between instruction adherence and the semantic richness valued by native speakers.

pdf bib abs

Negation-Aware Data Augmentation for Portuguese Natural Language Inference
Maria Cecília M. Corrêa | Felipe S. F. Paula | Matheus Westhelle | Viviane P. Moreira
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Negation plays a fundamental role in human communication and logical reasoning, yet it remains underrepresented in natural language inference (NLI) datasets. This work investigates the impact of targeted data augmentation using negation cues on the main NLI datasets for Portuguese (InferBR, ASSIN and ASSIN2). By synthetically generating new instances with negated hypotheses, we create more diverse training and test sets. A BERT-based model was fine-tuned and tested on the combined datasets and augmented data. The results show that the model was heavily influenced by the bias in the use of negation, and increased data diversity improves the model’s handling of negation.

2024

pdf bib

Beyond Single Models: Leveraging LLM Ensembles for Human Value Detection in Text
Diego Dimer Rodrigues | Mariana Recamonde-Mendoza | Viviane P. Moreira
Proceedings of the 15th Brazilian Symposium in Information and Human Language Technology

2020

pdf bib abs

BabelEnconding at SemEval-2020 Task 3: Contextual Similarity as a Combination of Multilingualism and Language Models
Lucas Rafael Costella Pessutto | Tiago de Melo | Viviane P. Moreira | Altigran da Silva
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the system submitted by our team (BabelEnconding) to SemEval-2020 Task 3: Predicting the Graded Effect of Context in Word Similarity. We propose an approach that relies on translation and multilingual language models in order to compute the contextual similarity between pairs of words. Our hypothesis is that evidence from additional languages can leverage the correlation with the human generated scores. BabelEnconding was applied to both subtasks and ranked among the top-3 in six out of eight task/language combinations and was the highest scoring system three times.

Co-authors

Mariana Recamonde-Mendoza 1

Diego Dimer Rodrigues 1

Matheus Westhelle 1

Altigran da Silva 1

Venues

Fix author