Viviane P. Moreira


2026

Questions and answers are among the most fundamental forms of human communication. Question Answering (QA) is the task of correctly generating answers based on a context. To assess the success of the task, the answers are typically evaluated using traditional metrics such as BLEU, ROUGE, and METEOR. However, these metrics often fail to reflect the actual quality of the outputs. More recently, new evaluation metrics and the LLM-as-a-judge paradigm have also been applied to the evaluation of QA. To gain a deeper understanding of the capabilities and limitations of QA metrics, this work performs a comparative analysis of both traditional and more recent approaches for QA evaluation. Experiments were conducted on the Pirá dataset (in Portuguese) using four LLMs to generate answers. Additionally, human evaluation was performed to assess aspects such as correctness, completeness, clarity, and relevance of the generated content. We demonstrate that lexical metrics are limited in evaluating QA. We also observed that human evaluators favor models that provide higher information density, even when this contradicts prompt constraints, whereas lexical metrics penalize this verbosity. This divergence confirms that traditional metrics are insufficient for capturing the trade-off between instruction adherence and the semantic richness valued by native speakers.
Negation plays a fundamental role in human communication and logical reasoning, yet it remains underrepresented in natural language inference (NLI) datasets. This work investigates the impact of targeted data augmentation using negation cues on the main NLI datasets for Portuguese (InferBR, ASSIN and ASSIN2). By synthetically generating new instances with negated hypotheses, we create more diverse training and test sets. A BERT-based model was fine-tuned and tested on the combined datasets and augmented data. The results show that the model was heavily influenced by the bias in the use of negation, and increased data diversity improves the model’s handling of negation.

2024

2020

This paper describes the system submitted by our team (BabelEnconding) to SemEval-2020 Task 3: Predicting the Graded Effect of Context in Word Similarity. We propose an approach that relies on translation and multilingual language models in order to compute the contextual similarity between pairs of words. Our hypothesis is that evidence from additional languages can leverage the correlation with the human generated scores. BabelEnconding was applied to both subtasks and ranked among the top-3 in six out of eight task/language combinations and was the highest scoring system three times.