Thiago A. S. Pardo


2026

Enhanced Universal Dependencies (EUD) provide a more informative syntactic representation than Basic Universal Dependencies by relaxing tree constraints to allow for graph structures. While conversion rules from basic to enhanced relations have been established for Portuguese, they were previously evaluated only on journalistic text using gold-standard basic syntactic trees. This paper evaluates the robustness of these rules in diverse scenarios ("in the wild"), encompassing other text genres and domains, as well as realistic parsing pipelines that rely on automatically generated basic syntax. Our results demonstrate that Portuguese-specific rules consistently outperform universal rules. However, the reliance on automatic basic syntax significantly impacts performance. This degradation is particularly severe when the domain of the input text differs from the training data of the basic parser. We also provide a detailed error analysis, identifying specific difficult linguistic phenomena and scenarios.
This Ph.D. dissertation advances the state-of-the-art in Natural Language Processing (NLP) for Portuguese by proposing new and innovative data resources and explainable methods for hate speech detection and automated fact-checking. The thesis introduces several benchmark datasets for Brazilian Portuguese, HateBR, HateBRXplain, HateBRMoralXplain, MFTCXplain, MOL, and FactNews, which have been widely adopted by the research community and address critical gaps in the availability of high-quality annotated resources for Portuguese. In addition, this dissertation proposes novel post-hoc and self-explaining NLP methods: Sentence-Level Factual Reasoning (SELFAR), Social Stereotype Analysis (SSA), Contextual Bag-of-Words with Interpretable Input and Feature Optimization (B+M), Supervised Rational Attention (SRA), and Supervised Moral Rational Attention (SMRA). Across multiple tasks and datasets in Portuguese, these methods outperform baselines while improving interpretability and robustness, demonstrating that explainability and performance can be jointly optimized. Finally, this thesis has achieved significant national and international impact, being cited by leading universities and research institutes worldwide and fostering new M.Sc. and Ph.D. research projects in Brazil. Its scientific and social contributions have also been recognized with multiple prestigious national and international awards, including the Google LARA, the Maria Carolina Monard Best Thesis Award in Artificial Intelligence, the Trevisan Prize for Students “AI for Good” from Bocconi University for rigorous computer science research in AI with social impact, and the Diversity and Inclusion Award from the Association for Computational Linguistics (ACL). Lastly, this thesis has received two nominations for the Brazilian Computer Society Thesis Awards in Computer Science, and in Multimedia, Hypermedia, and Web.
This paper reports on an effort to recover the classical morphosyntactically annotated corpus MacMorpho and realign it with the current version of the Universal Dependencies framework. We introduce a knowledge-rich approach grounded in a syntactic parser and on a specially designed tagset compatibility strategy in order to generate a "silver-standard" resource: the MacMorpho-UD-2.17. We evaluate this resource through multiple complementary methods, providing evidence for the quality of both our approach and the resulting annotation.

2025

Enhanced Universal Dependencies (EUD) serve as a crucial link between syntax and semantics. Beyond basic syntactic dependencies, EUD provides valuable refined logical connections for downstream tasks such as semantic role labeling, coreference resolution, information extraction, and question answering. The original EUD framework defines six types of relationships, but this paper introduces an extension designed to address subject propagation in pro-drop languages. This “Extended EUD” proposal increases the number of relationships that may be annotated in sentences, improving linguistic representation. Additionally, we report our experiments on a corpus of Portuguese (a pro-drop language), which we make publicly available to the research community.

2024

Most existing fact-checking systems are unable to explain their decisions by providing relevant rationales (justifications) for their predictions. It highlights a lack of transparency that poses significant risks, such as the prevalence of unexpected biases, which may increase political polarization due to limitations in impartiality. To address this critical gap, we introduce SEntence-Level FActual Reasoning (SELFAR), aimed at improving explainable fact-checking. SELFAR relies on fact extraction and verification by predicting the news source reliability and factuality (veracity) of news articles or claims at the sentence level, generating post-hoc explanations using SHAP/LIME and zero-shot prompts. Our experiments show that unreliable news stories predominantly consist of subjective statements, in contrast to reliable ones. Consequently, predicting unreliable news articles at the sentence level by analyzing impartiality and subjectivity is a promising approach for fact extraction and improving explainable fact-checking. Furthermore, LIME outperforms SHAP in explaining predictions on reliability. Additionally, while zero-shot prompts provide highly readable explanations and achieve an accuracy of 0.71 in predicting factuality, their tendency to hallucinate remains a challenge. Lastly, this paper also presents the first study on explainable fact-checking in the Portuguese language.

2023

2020

This paper describes the submission by the NILC Computational Linguistics research group of the University of São Paulo/Brazil to the RDF-to-Text task for English at the WebNLG+ challenge. The success of the current pretrained models like BERT or GPT-2 in text-to-text generation tasks is well-known, however, its application/success on data-totext generation has not been well-studied and proven. This way, we explore how good a pretrained model, in particular BART, performs on the data-to-text generation task. The results obtained were worse than the baseline and other systems in almost all automatic measures. However, the human evaluation shows better results for our system. Besides, results suggest that BART may generate paraphrases of reference texts.

2015

2013

2007