Norton Trevisan Roman

Also published as: Norton T. Roman, Norton Trevisan Roman

2026

Lexical and Orthographic Variation in Portuguese Financial Tweets: Annotation, Analysis, and Implications for Embedding Models
Ariani Di Felippo | Norton Trevisan Roman | Bryan K. S. Barbosa | Gabriela Pinheiro de Oliveira | Clarissa Lenina Scandarolli
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Twitter/X remains a key source of user-generated content, requiring Natural Language Processing tools capable of handling non-canonical language. This study presents a manual annotation of lexical and orthographic phenomena in DANTEStocks, a corpus of financial tweets in Brazilian Portuguese, using a hierarchical typology to capture both creative uses and deviations from the standard norm. Results show that orthographic variation is strongly influenced by creative forms, mainly driven by platform- and domain-specific innovations. Standard norm variation is systematic, mostly involving predictable omissions of diacritics and the cedilla, and most tokens exhibit only one phenomenon, reflecting stable and largely independent patterns of variation in this Twitter subgenre. The identified variant forms enabled the construction of a lexicon for evaluating embedding models. We assessed how BERTimbau, Word2Vec, and FastText handle lexical variation in raw, unnormalized data, showing that the lexicon reduces out-of-vocabulary rates and improves coverage. These results highlight model robustness and the value of curated lexical resources in complementing both fixed and data-driven vocabularies.

2025

pdf bib abs

Classifying Emotions in Tweets from the Financial Market: A BERT-based Approach
Wesley Pompeu Carvalho | Norton Trevisan Roman
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Behavioural finance emphasizes the relevance of investor sentiment and emotions in the pricing of financial assets. However, little research has examined how discrete emotions can be detected in text related to this domain, with extant work focusing mostly in sentiment instead. This study approaches this problem by describing a framework for emotion classification in tweets related to the stock market, written in Brazilian Portuguese. Emotion classifiers were then developed, based on Plutchik’s psychoevolutionary theory, by fine-tuning BERTimbau, a pre-trained BERT-based language model for Brazilian Portuguese, and applying it to an existing corpus of tweets, from the stock market domain, previously annotated with emotions. Each of Plutchik’s four emotional axes was modelled as a ternary classification problem. For each axis, 30 independent training iterations were executed using a repeated holdout strategy with different train/test splits in each iteration. In every iteration, hyperparameter tuning was performed via 10-fold stratified cross-validation on the training set to identify the best configuration. A final model was then retrained using the selected hyperparameters and evaluated on a hold-out test set, generating a distribution of macro-F1 scores in out-of-sample data. The results demonstrated statistically significant improvements over a stratified random baseline (Welch’s t-test, << 0.001 across all axes), with macro-F1 scores ranging from 0.50 to 0.61. These findings point to the feasibility of using transformer-based models to capture emotional nuance in financial texts written in Portuguese and provide a reproducible framework for future research.

pdf bib abs

It’s about What and How you say it: A Corpus with Stance and Sentiment Annotation for COVID-19 Vaccines Posts on X/Twitter by Brazilian Political Elites
Lorena Barberia | Pedro Schmalz | Norton Trevisan Roman | Belinda Lombard | Tatiane Moraes de Sousa
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

This paper details the development of a corpus with posts in Brazilian Portuguese published by Brazilian political elites on X (formerly Twitter) regarding COVID-19 vaccines. The corpus consists of 9,045 posts annotated for relevance, stance and sentiment towards COVID-19 vaccines and vaccination during the first three years of the COVID-19 pandemic (2020-2022).Nine annotators, working in three groups, classified relevance, stance, and sentiment in messages posted between 2020 and 2022 by local political elites. The annotators underwent extensive training, and weekly meetings were conducted to ensure intra-group annotation consistency. The analysis revealed fair to moderate inter-annotator agreement (Average Krippendorf’s alpha of 0.94 for relevance, 0,67 for sentiment and 0,70 for stance). This work makes four significant contributions to the literature. First, it addresses the scarcity of corpora in Brazilian Portuguese, particularly on COVID-19 or vaccines in general. Second, it provides a reliable annotation scheme for sentiment and stance classification, distinguishing both tasks, thereby improving classification precision. Third, it offers a corpus annotated with stance and sentiment according to this scheme, demonstrating how these tasks differ and how conflating them may lead to inconsistencies in corpus construction, as a results of confounding these phenomena — a recurring issue in NLP research beyond studies focusing on vaccines. And fourth, this annotated corpus may serve as the gold standard for fine-tuning and evaluating supervised machine learning models for relevance, sentiment and stance analysis of X posts on similar domains.

AgreeCalc: Uma Ferramenta para Análise da Concordância entre Múltiplos Anotadores (AgreeCalc: A Tool for the Analysis of Agreement Between Multiple Annotators) [in Portuguese]
Alexandre Rossi Alvares | Norton Trevisan Roman
Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology

pdf bib

Introducing a Corpus of Human-Authored Dialogue Summaries in Portuguese
Norton Trevisan Roman | Paul Piwek | Ariadne M. B. Rizzoni Carvalho | Alexandre Rossi Alvares
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013