Rafael Anchiêta


2024

pdf bib
Puntuguese: A Corpus of Puns in Portuguese with Micro-edits
Marcio Lima Inacio | Gabriela Wick-Pedro | Renata Ramisch | Luís Espírito Santo | Xiomara S. Q. Chacon | Roney Santos | Rogério Sousa | Rafael Anchiêta | Hugo Goncalo Oliveira
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Humor is an intricate part of verbal communication and dealing with this kind of phenomenon is essential to building systems that can process language at large with all of its complexities. In this paper, we introduce Puntuguese, a new corpus of punning humor in Portuguese, motivated by previous works showing that currently available corpora for this language are still unfit for Machine Learning due to data leakage. Puntuguese comprises 4,903 manually-gathered punning one-liners in Brazilian and European Portuguese. To create negative examples that differ exclusively in terms of funniness, we carried out a micro-editing process, in which all jokes were edited by fluent Portuguese speakers to make the texts unfunny. Finally, we did some experiments on Humor Recognition, showing that Puntuguese is considerably more difficult than the previous corpus, achieving an F1-Score of 68.9%. With this new dataset, we hope to enable research not only in NLP but also in other fields that are interested in studying humor; thus, the data is publicly available.

pdf bib
A Reproducibility Analysis of Portuguese Computational Processing Conferences: A Case of Study
Daniel Leal | Anthony Luz | Rafael Anchiêta
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

2021

pdf bib
A Semi-Supervised Approach to Detect Toxic Comments
Ghivvago Damas Saraiva | Rafael Anchiêta | Francisco Assis Ricarte Neto | Raimundo Moura
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Toxic comments contain forms of non-acceptable language targeted towards groups or individuals. These types of comments become a serious concern for government organizations, online communities, and social media platforms. Although there are some approaches to handle non-acceptable language, most of them focus on supervised learning and the English language. In this paper, we deal with toxic comment detection as a semi-supervised strategy over a heterogeneous graph. We evaluate the approach on a toxic dataset of the Portuguese language, outperforming several graph-based methods and achieving competitive results compared to transformer architectures.

2020

pdf bib
Semantically Inspired AMR Alignment for the Portuguese Language
Rafael Anchiêta | Thiago Pardo
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Abstract Meaning Representation (AMR) is a graph-based semantic formalism where the nodes are concepts and edges are relations among them. Most of AMR parsing methods require alignment between the nodes of the graph and the words of the sentence. However, this alignment is not provided by manual annotations and available automatic aligners focus only on the English language, not performing well for other languages. Aiming to fulfill this gap, we developed an alignment method for the Portuguese language based on a more semantically matched word-concept pair. We performed both intrinsic and extrinsic evaluations and showed that our alignment approach outperforms the alignment strategies developed for English, improving AMR parsers, and achieving competitive results with a parser designed for the Portuguese language.

2018

pdf bib
Towards AMR-BR: A SemBank for Brazilian Portuguese Language
Rafael Anchiêta | Thiago Pardo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Improving Opinion Summarization by Assessing Sentence Importance in On-line Reviews
Rafael Anchiêta | Rogerio Figueredo Sousa | Raimundo Moura | Thiago Pardo
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology