RacismoBR: A Manually Annotated Dataset for Racist Discourse Detection in Brazilian Portuguese

João Vítor Vaz, Fabrício Benevenuto, Marcos André Gonçalves


Abstract
Racist discourse on social media appears both through explicit attacks and subtle, context-dependent forms, remaining a challenge for Natural Language Processing. We introduce RacismoBR, a culturally grounded dataset for detecting racist discourse in Brazilian Portuguese, manually annotated exclusively by Black researchers to ensure sociolinguistic validity and epistemic representativeness. We conduct a controlled evaluation of binary racism classification in our dataset considering several classification modeling paradigms: classical machine learning, supervised Transformer-based (Small) Language Models, and Large Language models under in-context, few-shot learning. Results show that GPT-4.1 and BERTimbau yield the highest Macro-F1 scores; however, Wilcoxon signed-rank tests reveal no statistically significant differences across models, mostly due to high variability. Across paradigms, classifiers consistently display higher precision for non-racist content and higher recall for racist content. A qualitative analysis highlights persistent difficulties with implicit, euphemized, and context-dependent racism. These findings indicate that culturally grounded annotation plays a more decisive role than architectural sophistication alone in advancing racism detection.
Anthology ID:
2026.propor-1.76
Volume:
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:
April
Year:
2026
Address:
Salvador, Brazil
Editors:
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
770–779
Language:
URL:
https://aclanthology.org/2026.propor-1.76/
DOI:
Bibkey:
Cite (ACL):
João Vítor Vaz, Fabrício Benevenuto, and Marcos André Gonçalves. 2026. RacismoBR: A Manually Annotated Dataset for Racist Discourse Detection in Brazilian Portuguese. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 770–779, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):
RacismoBR: A Manually Annotated Dataset for Racist Discourse Detection in Brazilian Portuguese (Vaz et al., PROPOR 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.propor-1.76.pdf