Carlos Henrique Santos Barros

2026

Extending an Ensemble Baseline with Corpus-Based Graph Features for Portuguese Pun Detection
Avelar Rodrigues de Sousa | Camilla Soares Sousa | Carlos Henrique Santos Barros | Rafael Torres Anchiêta
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Automatic pun detection remains challenging because it depends on lexical ambiguity and contextual interaction, which are not explicitly captured by linear text representations. In Portuguese, TF-IDF-based ensemble methods provide competitive and interpretable baselines, but remain limited by surface-level features. This work investigates whether corpus-based graph information can complement such methods. Three graph representations are constructed from the Puntuguese corpus: a Co-occurrence graph, a PPMI-weighted graph, and a Pun-Context graph. In the current pipeline, each graph is converted into low-dimensional node embeddings with TruncatedSVD, which are then aggregated into document-level features and concatenated with TF-IDF representations in a soft-voting ensemble. Experimental results on the test set show that graph-based enrichment does not uniformly improve performance: Pun-Context and PPMI yield the strongest graph-augmented results, whereas combining all graphs degrades performance. These findings indicate that the usefulness of graph-based information depends strongly on how lexical relations are encoded and aggregated at the document level.

pdf bib abs

The Superficiality Bias: Community Votes and Answer Utility in Portuguese Health Question Answering
Carlos Henrique Santos Barros | Gustavo Figueredo Rodrigues de Sousa | Rogério Figueredo de Sousa
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Supervised models trained on community-labeled data have shown promise in Health Question Answering (HQA), but relying on “likes” as a proxy for clinical usefulness remains controversial. This work investigates the alignment between automated predictions and human perception in Portuguese HQA. Using a subset of the SaudeBR-QA corpus, we compare a Random Forest classifier against a controlled evaluation conducted by laypeople and healthcare professionals. Our results reveal a recurring divergence that we term Superficiality Bias: human evaluators frequently validate very brief answers, whereas the classifier often labels these cases as non-useful under its learned criteria. Rather than indicating that the model is inherently more clinically accurate, this pattern suggests a misalignment between community feedback and feature-driven utility judgments. We argue that crowd-based labels in medical domains should be treated cautiously and complemented with more rigorous annotation protocols.

Co-authors

Venues

PROPOR2

Fix author