Viviane P. Moreira
2026
The Inadequacy of Automatic Evaluation Metrics in Question Answering: A Case-Study in Portuguese
Júlia da Rocha Junqueira | Viviane P. Moreira
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Júlia da Rocha Junqueira | Viviane P. Moreira
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Questions and answers are among the most fundamental forms of human communication. Question Answering (QA) is the task of correctly generating answers based on a context. To assess the success of the task, the answers are typically evaluated using traditional metrics such as BLEU, ROUGE, and METEOR. However, these metrics often fail to reflect the actual quality of the outputs. More recently, new evaluation metrics and the LLM-as-a-judge paradigm have also been applied to the evaluation of QA. To gain a deeper understanding of the capabilities and limitations of QA metrics, this work performs a comparative analysis of both traditional and more recent approaches for QA evaluation. Experiments were conducted on the Pirá dataset (in Portuguese) using four LLMs to generate answers. Additionally, human evaluation was performed to assess aspects such as correctness, completeness, clarity, and relevance of the generated content. We demonstrate that lexical metrics are limited in evaluating QA. We also observed that human evaluators favor models that provide higher information density, even when this contradicts prompt constraints, whereas lexical metrics penalize this verbosity. This divergence confirms that traditional metrics are insufficient for capturing the trade-off between instruction adherence and the semantic richness valued by native speakers.
Negation-Aware Data Augmentation for Portuguese Natural Language Inference
Maria Cecília M. Corrêa | Felipe S. F. Paula | Matheus Westhelle | Viviane P. Moreira
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Maria Cecília M. Corrêa | Felipe S. F. Paula | Matheus Westhelle | Viviane P. Moreira
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Negation plays a fundamental role in human communication and logical reasoning, yet it remains underrepresented in natural language inference (NLI) datasets. This work investigates the impact of targeted data augmentation using negation cues on the main NLI datasets for Portuguese (InferBR, ASSIN and ASSIN2). By synthetically generating new instances with negated hypotheses, we create more diverse training and test sets. A BERT-based model was fine-tuned and tested on the combined datasets and augmented data. The results show that the model was heavily influenced by the bias in the use of negation, and increased data diversity improves the model’s handling of negation.
2024
Beyond Single Models: Leveraging LLM Ensembles for Human Value Detection in Text
Diego Dimer Rodrigues | Mariana Recamonde-Mendoza | Viviane P. Moreira
Proceedings of the 15th Brazilian Symposium in Information and Human Language Technology
Diego Dimer Rodrigues | Mariana Recamonde-Mendoza | Viviane P. Moreira
Proceedings of the 15th Brazilian Symposium in Information and Human Language Technology
2020
BabelEnconding at SemEval-2020 Task 3: Contextual Similarity as a Combination of Multilingualism and Language Models
Lucas Rafael Costella Pessutto | Tiago de Melo | Viviane P. Moreira | Altigran da Silva
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Lucas Rafael Costella Pessutto | Tiago de Melo | Viviane P. Moreira | Altigran da Silva
Proceedings of the Fourteenth Workshop on Semantic Evaluation
This paper describes the system submitted by our team (BabelEnconding) to SemEval-2020 Task 3: Predicting the Graded Effect of Context in Word Similarity. We propose an approach that relies on translation and multilingual language models in order to compute the contextual similarity between pairs of words. Our hypothesis is that evidence from additional languages can leverage the correlation with the human generated scores. BabelEnconding was applied to both subtasks and ranked among the top-3 in six out of eight task/language combinations and was the highest scoring system three times.