Júlia da Rocha Junqueira

2026

LARI Dataset: A Native Portuguese Question Answering Dataset from Brasileiras em PLN
Júlia da Rocha Junqueira | Larissa A. de Freitas | Ulisses Brisolara Corrêa
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Recent advances in the field have revolutionized Question and Answering (QA). However, for languages like Portuguese, progress is often hindered by the lack of native training resources. To address this gap, this paper introduces LARI, a new dataset designed to benchmark and enhance QA in Portuguese. Our methodology combines the capabilities of the Sabiá-7B model, fine-tuned via QLoRA on a domain-specific corpus, with human validation. We utilized the book Natural Language Processing – Concepts, Techniques, and Applications in Portuguese (2nd Edition), as a case study for content extraction. The generated instances underwent expert human evaluation, achieving an average quality score of 4.47 out of 5. The final dataset, comprising 464 context-question-answer triples, is made publicly available to the community, offering a valuable resource for future research in low-resource settings.

pdf bib abs

The Inadequacy of Automatic Evaluation Metrics in Question Answering: A Case-Study in Portuguese
Júlia da Rocha Junqueira | Viviane P. Moreira
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Questions and answers are among the most fundamental forms of human communication. Question Answering (QA) is the task of correctly generating answers based on a context. To assess the success of the task, the answers are typically evaluated using traditional metrics such as BLEU, ROUGE, and METEOR. However, these metrics often fail to reflect the actual quality of the outputs. More recently, new evaluation metrics and the LLM-as-a-judge paradigm have also been applied to the evaluation of QA. To gain a deeper understanding of the capabilities and limitations of QA metrics, this work performs a comparative analysis of both traditional and more recent approaches for QA evaluation. Experiments were conducted on the Pirá dataset (in Portuguese) using four LLMs to generate answers. Additionally, human evaluation was performed to assess aspects such as correctness, completeness, clarity, and relevance of the generated content. We demonstrate that lexical metrics are limited in evaluating QA. We also observed that human evaluators favor models that provide higher information density, even when this contradicts prompt constraints, whereas lexical metrics penalize this verbosity. This divergence confirms that traditional metrics are insufficient for capturing the trade-off between instruction adherence and the semantic richness valued by native speakers.

Co-authors

Venues

PROPOR2

Fix author