Social-RAG: A Retrieval-Augmented Generation Pipeline for Computational Social Science Research on Telegram

Leonardo Nascimento; Eric Brasil; Arthur Lima; Gabriel Andrade; Ricardo José Andrade; Tarssio Barreto

Social-RAG: A Retrieval-Augmented Generation Pipeline for Computational Social Science Research on Telegram

Leonardo Nascimento, Eric Brasil, Arthur Lima, Gabriel Andrade, Ricardo José Andrade, Tarssio Barreto

Abstract

Digital trace data have expanded empirical opportunities in the social sciences while intensifying the methodological challenge of scale: researchers increasingly face corpora too large and fast-moving to read exhaustively without sacrificing interpretive rigor. This article presents Social-RAG, a modular Retrieval-Augmented Generation (RAG) architecture designed to support scalable qualitative inquiry over large text corpora while preserving evidence traceability, auditability, and researcher control. Our empirical basis consists of messages from public Telegram groups and channels, organized into two thematic subsets: vaccine-related discourse and debates surrounding Brazil’s Lei Rouanet cultural funding policy. We detail key design decisions, including a “one post = one chunk” indexing strategy, semantic retrieval over vector embeddings with efficient ANN search, an Adaptive-K dynamic cutoff for context selection, MMR re-ranking for diversity, and structured analytical instructions that constrain generation to retrieved evidence. We evaluate system behavior using two complementary question blocks, hermeneutic (narrative) and factual, and compare outputs across three language models with distinct deployment profiles (a local open-weight model, a cloud open-weight model, and a commercial closed model), using an LLM-as-judge protocol with explicit qualitative criteria. Results show consistent behaviour across both thematic corpora and highlight a key trade-off: the two larger/closed models perform similarly and robustly in both narrative and factual tasks when evidential discipline is maintained, whereas the smaller local model remains useful for exploratory narrative synthesis but is less reliable for strict factual extraction and attribution. We conclude by discussing methodological implications, limitations, and future directions, with a focus on scalability and extensibility to new data types and analytical problems.

Anthology ID:: 2026.propor-2.34
Volume:: Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Month:: April
Year:: 2026
Address:: Salvador, Brazil
Editors:: Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:: PROPOR
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 255–265
Language:
URL:: https://aclanthology.org/2026.propor-2.34/
DOI:
Bibkey:
Cite (ACL):: Leonardo Nascimento, Eric Brasil, Arthur Lima, Gabriel Andrade, Ricardo José Andrade, and Tarssio Barreto. 2026. Social-RAG: A Retrieval-Augmented Generation Pipeline for Computational Social Science Research on Telegram. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2, pages 255–265, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):: Social-RAG: A Retrieval-Augmented Generation Pipeline for Computational Social Science Research on Telegram (Nascimento et al., PROPOR 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.propor-2.34.pdf

PDF Cite Search Fix data