Gisele L. Pappa


2026

Dense retrieval is a critical component of Retrieval-Augmented Generation (RAG) systems and is highly sensitive to document representations. In consumer complaint settings, raw interaction texts are often lengthy and noisy, which limits retrieval effectiveness. This paper investigates whether schema-guided structured summaries can improve dense retrieval in RAG. We compare embeddings derived from raw interaction texts and from LLM-generated structured summaries in a controlled evaluation on Portuguese-language consumer complaints. Summary-based retrieval achieves a Recall@1 of 0.527, compared to 0.001 when indexing raw interactions, and reaches Recall@10 of 0.610, demonstrating gains of more than two orders of magnitude. These results show that structured summaries enable more effective and reliable retrieval at low cutoffs, making them particularly suitable for RAG pipelines.
We present ConsumerBR, a large-scale corpus of consumer complaints and company responses in Brazilian Portuguese, compiled from publicly available data on the Consumidor.gov.br platform. The corpus comprises over 3.1 million consumer–company interactions collected between 2021 and 2025 and combines anonymized textual content with rich structured metadata, including temporal information, complaint outcomes, and consumer satisfaction indicators. We describe a data collection strategy tailored to the platform’s dynamic interface, a preprocessing pipeline that includes response clustering to identify template-based replies, and a hybrid anonymization approach designed to mitigate privacy risks. We also provide a detailed statistical characterization of the corpus, highlighting its scale, coverage, and distributional properties. ConsumerBR is publicly available for research purposes and supports a wide range of applications, including complaint analysis, sentiment modeling, dialogue and response generation, and preference-based evaluation.

2024

Material facts (MF) are crucial and obligatory disclosures that can significantly influence asset values. Following their release, financial analysts embark on the meticulous and highly specialized task of crafting analyses to shed light on their impact on company assets, a challenge elevated by the daily amount of MFs released. Generative AI, with its demonstrated power of crafting coherent text, emerges as a promising solution to this task. However, while these analyses must incorporate the MF, they must also transcend it, enhancing it with vital background information, valuable and grounded recommendations, prospects, potential risks, and their underlying reasoning. In this paper, we approach this task as an instance of controllable text generation, aiming to ensure adherence to the MF and other pivotal attributes as control elements. We first explore language models’ capacity to manage this task by embedding those elements into prompts and engaging popular chatbots. A bilingual proof of concept underscores both the potential and the challenges of applying generative AI techniques to this task.