João Vítor Vaz
2026
RacismoBR: A Manually Annotated Dataset for Racist Discourse Detection in Brazilian Portuguese
João Vítor Vaz | Fabrício Benevenuto | Marcos André Gonçalves
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
João Vítor Vaz | Fabrício Benevenuto | Marcos André Gonçalves
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Racist discourse on social media appears both through explicit attacks and subtle, context-dependent forms, remaining a challenge for Natural Language Processing. We introduce RacismoBR, a culturally grounded dataset for detecting racist discourse in Brazilian Portuguese, manually annotated exclusively by Black researchers to ensure sociolinguistic validity and epistemic representativeness. We conduct a controlled evaluation of binary racism classification in our dataset considering several classification modeling paradigms: classical machine learning, supervised Transformer-based (Small) Language Models, and Large Language models under in-context, few-shot learning. Results show that GPT-4.1 and BERTimbau yield the highest Macro-F1 scores; however, Wilcoxon signed-rank tests reveal no statistically significant differences across models, mostly due to high variability. Across paradigms, classifiers consistently display higher precision for non-racist content and higher recall for racist content. A qualitative analysis highlights persistent difficulties with implicit, euphemized, and context-dependent racism. These findings indicate that culturally grounded annotation plays a more decisive role than architectural sophistication alone in advancing racism detection.