Gisele L. Pappa
2026
Structured Summaries for Retrieval-Augmented Generation in Portuguese-Language Consumer Complaints
Rafael Sant'Ana | Pedro Garcia | Luis A. Duarte | Mariana O. Silva | Adriano C. M. Pereira | Gisele L. Pappa
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Rafael Sant'Ana | Pedro Garcia | Luis A. Duarte | Mariana O. Silva | Adriano C. M. Pereira | Gisele L. Pappa
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Dense retrieval is a critical component of Retrieval-Augmented Generation (RAG) systems and is highly sensitive to document representations. In consumer complaint settings, raw interaction texts are often lengthy and noisy, which limits retrieval effectiveness. This paper investigates whether schema-guided structured summaries can improve dense retrieval in RAG. We compare embeddings derived from raw interaction texts and from LLM-generated structured summaries in a controlled evaluation on Portuguese-language consumer complaints. Summary-based retrieval achieves a Recall@1 of 0.527, compared to 0.001 when indexing raw interactions, and reaches Recall@10 of 0.610, demonstrating gains of more than two orders of magnitude. These results show that structured summaries enable more effective and reliable retrieval at low cutoffs, making them particularly suitable for RAG pipelines.
ConsumerBR: A Large-Scale Corpus of Consumer Complaints in Brazilian Portuguese
Luis A. Duarte | Pedro Giacomin | Vitória Bispo | Mariana O. Silva | Adriano C. M. Pereira | Gisele L. Pappa
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Luis A. Duarte | Pedro Giacomin | Vitória Bispo | Mariana O. Silva | Adriano C. M. Pereira | Gisele L. Pappa
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
We present ConsumerBR, a large-scale corpus of consumer complaints and company responses in Brazilian Portuguese, compiled from publicly available data on the Consumidor.gov.br platform. The corpus comprises over 3.1 million consumer–company interactions collected between 2021 and 2025 and combines anonymized textual content with rich structured metadata, including temporal information, complaint outcomes, and consumer satisfaction indicators. We describe a data collection strategy tailored to the platform’s dynamic interface, a preprocessing pipeline that includes response clustering to identify template-based replies, and a hybrid anonymization approach designed to mitigate privacy risks. We also provide a detailed statistical characterization of the corpus, highlighting its scale, coverage, and distributional properties. ConsumerBR is publicly available for research purposes and supports a wide range of applications, including complaint analysis, sentiment modeling, dialogue and response generation, and preference-based evaluation.
2024
Analysis of Material Facts on Financial Assets: A Generative AI Approach
Gabriel Assis | Daniela Vianna | Gisele L. Pappa | Alexandre Plastino | Wagner Meira Jr | Altigran Soares da Silva | Aline Paes
Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing
Gabriel Assis | Daniela Vianna | Gisele L. Pappa | Alexandre Plastino | Wagner Meira Jr | Altigran Soares da Silva | Aline Paes
Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing
Material facts (MF) are crucial and obligatory disclosures that can significantly influence asset values. Following their release, financial analysts embark on the meticulous and highly specialized task of crafting analyses to shed light on their impact on company assets, a challenge elevated by the daily amount of MFs released. Generative AI, with its demonstrated power of crafting coherent text, emerges as a promising solution to this task. However, while these analyses must incorporate the MF, they must also transcend it, enhancing it with vital background information, valuable and grounded recommendations, prospects, potential risks, and their underlying reasoning. In this paper, we approach this task as an instance of controllable text generation, aiming to ensure adherence to the MF and other pivotal attributes as control elements. We first explore language models’ capacity to manage this task by embedding those elements into prompts and engaging popular chatbots. A bilingual proof of concept underscores both the potential and the challenges of applying generative AI techniques to this task.